Balinese Story Texts Dataset - Characters, Aliases, and their Classification
Description
This dataset consists of 120 Balinese story texts (as known as Satua Bali) which have been annotated for character analysis purposes, including character identification, alias clustering, and character classification into protagonist or antagonist. The labeling involved two Balinese native speakers who were fluent in understanding Balinese story texts. One of them is an expert in the fields of sociolinguistics and macrolinguistics. Reliability and level of agreement in the dataset are measured by Cohen's kappa coefficient, Jaccard similarity coefficient, and F1-score and all of them show almost perfect agreement values (>0,81). There are four main folders, each used for different character analysis purposes: 1. First Dataset (charsNamedEntity): 89,917 tokens annotated with five character named entity labels (ANM, ADJ, PNAME, GODS, OBJ) for character named entity recognition purpose 2. Second Dataset (charsExtraction): 6,634 annotated sentences for the purpose of character identification at the sentence level 3. Third Dataset (charsAliasClustering): 930 lists of character groups from 120 story texts for the purpose of alias clustering 4. Fourth Dataset (charsClassification): 848 lists of character groups that have been filtered into two groups (Protagonist and Antagonist)
Files
Steps to reproduce
The titles and text contents of 120 story texts were obtained by web-scraping using the BeautifulSoup4 package and Python programming languages from two digital sites [1], [2]. The raw dataset that has been obtained is then preprocessed to remove irrelevant text information. The preprocessed dataset is then annotated in three stages, namely, pilot annotation, independent annotation, and complete annotation. The level of agreement and reliability of the annotation results from two annotators was measured at the pilot and independent annotation stages. After meeting the threshold, the remaining 96 story texts that had not been annotated were then divided into two data subsets of 48 texts each and then annotated independently by the two annotators. Gold-standard annotation results from these four datasets are published in this repository.