Project Gutenberg - Characters, Authors and their Gender
Steps to reproduce
For each book, as a first step, we split the input text into sentences to increase the performance of character extraction. To do so, we used a Python-based sentence segmentation module called SegTok that is capable of identifying sentence terminals such as ‘.’, ‘?’ and ‘!’, as well as disambiguating them when they appear in the middle of a sentence e.g., in the case of abbreviations and website links. We evaluated the accuracy of SegTok by randomly sampling 110 sentence outputs that were segmented, and manually tagging them as being correctly segmented with respect to the paragraph in which the sentence was originally embedded. We found that, of these 110 sentences, only two were incorrectly segmented, leading to an accuracy of 98.18%. Next, we extract named characters from each sentence using an NLP technique called Named Entity Recognition (NER). NER seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, monetary values, to name a few. For this dataset, we are only interested in extracting person names from the text of the books. Once extracted, we measure the number of times each character is mentioned in the book. In order to ensure we do not count the ‘same’ character (named in slightly different ways e.g., ‘Darcy’ and ‘Mr. Darcy’), we need to disambiguate the character extractions from the previous step. We used the SequenceMatcher class from the Python-based difflib library to do so. This class specifically compares two strings and provides a similarity score between 0 (no match at all) to 1 (complete match, i.e., strings are the same). We treated string pairs, representing character extraction pairs, with a similarity score of 0.70 or above as duplicates. This threshold was selected after some sampling and manual verification. This disambiguation also allows us to count and record the number of unique characters extracted from each book. To assess its accuracy, we randomly sampled 76 character pairs that were disambiguated as duplicates by this heuristic technique, and found that 72 were correctly disambiguated, yielding an accuracy of 94.74%. We also count the number of male (he, him, his) and female (she, her, hers) pronouns that appear in each book. Simple string-based pattern matching is used to accomplish this. Finally, to classify the extracted characters as male and female, we used the Python-based Gender_Detector library developed using data from the Global Name Data project, which is able to determine the gender of a character from the first name. Using this library, we were able to heuristically tag each extracted character as male or female. We evaluated the accuracy of this method by randomly sampling 100 extracted characters and manually checking their actual gender against the predicted gender. There was only one error, yielding an accuracy of 99%.