Balinese Story Texts Dataset - Characters, Aliases, and their Classification

Published: 25 March 2024| Version 3 | DOI: 10.17632/h2tf5ymcp9.3
Contributors:
,
,

Description

This dataset consists of 120 Balinese story texts (as known as Satua Bali) which have been annotated for narrative text analysis purposes, including character identification, alias clustering, and character classification into protagonist or antagonist. The labeling involved two Balinese native speakers who were fluent in understanding Balinese story texts. One of them is an expert in the fields of sociolinguistics and macrolinguistics. Reliability and level of agreement in the dataset are measured by Cohen's kappa coefficient, Jaccard similarity coefficient, and F1-score and all of them show almost perfect agreement values (>0,81). There are four main folders, each used for different narrative text analysis purposes: 1. First Dataset (charsNamedEntity): 89,917 annotated tokens with five character named entity labels (ANM, ADJ, PNAME, GODS, OBJ) for character named entity recognition purpose 2. Second Dataset (charsExtraction): 6,634 annotated sentences for the purpose of character identification at the sentence level 3. Third Dataset (charsAliasClustering): 930 lists of character groups from 120 story texts for the purpose of alias clustering 4. Fourth Dataset (charsClassification): 848 lists of character groups that have been classified into two groups (Protagonist and Antagonist)

Files

Steps to reproduce

Only the story text's title and content were extracted from the 120 story texts that we were able to collect, either by manual book digitization [1] or through web scraping from two digital sites [2,3] using the BeautifulSoup4 package and Python programming languages. The raw dataset that has been obtained is then preprocessed to remove irrelevant text information. The preprocessed dataset is then annotated in three stages, namely, pilot annotation, independent annotation, and complete annotation. We involved two annotators who are fluent in Balinese (one of whom is a linguistics expert) to annotate this dataset. The level of agreement and reliability of the annotation results from two annotators was measured at the pilot and independent annotation stages. After meeting the threshold (>0.81), the remaining 96 story texts that had not been annotated were then divided into two data subsets of 48 texts each and then annotated independently by the two annotators. Gold-standard annotation results from these four datasets are published in this repository. References: [1] I. N. Suwija, I. M. Darmada, and I. N. R. Mulyawan, Kumpulan Satua (Dongeng Rakyat Bali). Denpasar: Pelawa Sari, 2019. [2] “Kumpulan Satua Bali.” Accessed: Jan. 4, 2023. [Online]. Available: https://satua-bali.blogspot.com/ [3] “Kumpulan Daftar Contoh Satua Bali.” Accessed: Jan. 3, 2023. [Online]. Available: https://msatuabali.blogspot.com/

Institutions

Universitas Udayana, Institut Teknologi Sepuluh Nopember

Categories

Computer Science, Algorithms, Information Retrieval, Natural Language Processing, Machine Learning, Narrative Analysis, Text Mining

Licence