Indonesian Biodiversity-related Tweets Including Health, Food Security, and Environmental Management Issues for Sentiment Analysis
Description
The dataset was gathered using Twitter API services for around 30 particular biodiversity-related keywords with dates ranging from January 2020 to March 2023. This data was then refined by filtering out irrelevant information, including non-Indonesian language content, non-Biodiversity data, spam, and duplicate entries. Independent analysts undertook the task of manually assigning sentiment labels to the dataset. These eighteen individuals consisted of twelve researchers and engineers specializing in natural language processing, of which two held Ph.D. degrees, nine had MSc degrees, and one had a BSc degree. Additionally, four lecturers and two experts in natural language processing, each with a Ph.D. or MSc degree, contributed to the labeling process. The sentiments were divided into three classes, and the principle of majority voting determined the final class label.
Files
Steps to reproduce
* Collecting data can be done by referring to tweet IDs in the file biodiversity_raw.csv. * The file biodiversity_labeled.csv contained 1st annotator label, 2nd annotator label, 3rd annotator label, and the final label, so that users may compare their labels with ours in the file. * For creating a model, based on our experiments, the best model was the model created from IndoBert Tweet, which can be downloaded from the Hugging Face site. * We are drafting a paper titled "Twitter Dataset on Public Sentiments Towards Biodiversity Policy in Indonesia." Should there be any problems, users may refer to this paper.
Institutions
Categories
Funding
University of Indonesia
NKB-104/UN2.RST/HKP.05.00/2022
National Research and Innovation Agency
1/III.6/HK/2023