Semantic Role Labeling Datasets for Crisis Event
Description
This dataset is a collection of Indonesian language text data extracted from Twitter and designed to support research in Semantic Role Labeling (SRL) and Named Entity Recognition (NER) tasks. This dataset offers a combination of argument labels for SRL and entity labels for NER, which are very important in information extraction. Through SRL, the dataset can identify semantic roles, such as who the victim is, where the location is, what the cause is, and how the impact of the crisis event is. Meanwhile, NER can recognize entities such as the name of a place or organization involved. The data was collected using the Twitter API with relevant keywords from January 2018 to December 2023, covering four major crisis events in Indonesia: fires, accidents, floods, and earthquakes. Two experienced annotators carried out the annotation process—a disaster management expert and a doctoral student in computer science. The level of agreement between annotators was measured using Fleiss' Kappa, with results above 0.80 for all labels, indicating high data quality and consistency in labeling.
Files
Steps to reproduce
- Information extraction for SRL and NER tasks: 1. Use argument labels as targets to train a model in the SRL task, while entity labels can be used as target labels to train a model in the NER task. 2. Data with argument labels and entity labels are available in the labeled-data.csv file, which is the final output of the annotation process for this dataset. - If users face difficulties using this dataset, complete details about the methods, data structures, and annotation procedures are available in our draft paper "Annotated Data for Semantic Role Labeling of Crisis Events in Indonesian Tweets". This draft paper includes additional guidance and context to make using the dataset easier.
Institutions
Categories
Funding
the Ministry of Education and Culture of the Republic of Indonesia
038/E5/PG.02.00.PL/2024