Project Gutenberg - Characters, Authors and their Gender

Published: 6 December 2021| Version 1 | DOI: 10.17632/cmkmx2gzb3.1
Akarsh Nagaraj


The raw data for the study was obtained from a cleaned version of Project Gutenberg, which has 3,036 English books as text files, penned by 142 authors between 1700-1950. Out of the 142 authors, 14 are female. The first file, a 'books_data_json.txt', contains processed books data is in the form of a JSON (JavaScript Object Notation) text file containing each of the 3036 books information in as many dictionaries. Each book has the title of the book as the main key and another dictionary with the information about the book as its value. The keys in this dictionary are: ‘author_male?’: True if the author is male and False if the author if female ‘author_year’: active decade of the author ‘characters’: a list of all characters in the book ‘character_count’: {‘male’: number of male characters, ‘female’: number of female characters} ‘character_occurrence_count’: {‘male’: number of times male characters are referenced, ‘female’: number of times female characters are referenced} ‘pronoun_count’: {‘male’: number of male pronouns in the book, ‘female’: number of female pronouns in the book} The second file, 'Authors_Metadata.xlsx', contains information about each author in the dataset, specifically - gender, active year of the author, genres of their work and the number of books by the author in the dataset. All the information about the author was collected manually using resources from the internet like Wikipedia.


Steps to reproduce

For each book, as a first step, we split the input text into sentences to increase the performance of character extraction. To do so, we used a Python-based sentence segmentation module called SegTok that is capable of identifying sentence terminals such as ‘.’, ‘?’ and ‘!’, as well as disambiguating them when they appear in the middle of a sentence e.g., in the case of abbreviations and website links. We evaluated the accuracy of SegTok by randomly sampling 110 sentence outputs that were segmented, and manually tagging them as being correctly segmented with respect to the paragraph in which the sentence was originally embedded. We found that, of these 110 sentences, only two were incorrectly segmented, leading to an accuracy of 98.18%. Next, we extract named characters from each sentence using an NLP technique called Named Entity Recognition (NER). NER seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, monetary values, to name a few. For this dataset, we are only interested in extracting person names from the text of the books. Once extracted, we measure the number of times each character is mentioned in the book. In order to ensure we do not count the ‘same’ character (named in slightly different ways e.g., ‘Darcy’ and ‘Mr. Darcy’), we need to disambiguate the character extractions from the previous step. We used the SequenceMatcher class from the Python-based difflib library to do so. This class specifically compares two strings and provides a similarity score between 0 (no match at all) to 1 (complete match, i.e., strings are the same). We treated string pairs, representing character extraction pairs, with a similarity score of 0.70 or above as duplicates. This threshold was selected after some sampling and manual verification. This disambiguation also allows us to count and record the number of unique characters extracted from each book. To assess its accuracy, we randomly sampled 76 character pairs that were disambiguated as duplicates by this heuristic technique, and found that 72 were correctly disambiguated, yielding an accuracy of 94.74%. We also count the number of male (he, him, his) and female (she, her, hers) pronouns that appear in each book. Simple string-based pattern matching is used to accomplish this. Finally, to classify the extracted characters as male and female, we used the Python-based Gender_Detector library developed using data from the Global Name Data project, which is able to determine the gender of a character from the first name. Using this library, we were able to heuristically tag each extracted character as male or female. We evaluated the accuracy of this method by randomly sampling 100 extracted characters and manually checking their actual gender against the predicted gender. There was only one error, yielding an accuracy of 99%.


University of Southern California


Natural Language Processing, Gender Disparitiy, Gender Gap, Book Review, English Literature