EmoTweetID: Indonesian Emotion Tweet Dataset

Published: 4 December 2025| Version 5 | DOI: 10.17632/jzgnjsff9f.5
Contributors:
Kuncahyo Setyo Nugroho, Fitra Abdurrachman Bachtiar, Wayan Firdaus Mahmudy, Matthew Martianus Henry, Mahmud Isnan, Gusti Pangestu, Bens Pardamean

Description

The EmoTweetID dataset is a publicly available resource of Indonesian tweets collected from X (formerly Twitter) using emotion-related keywords. The dataset consists of three main components: 1. EmoTweetID-Corpus.csv: 3,126,987 unlabeled tweets for unsupervised tasks such as word embedding construction. 2. EmoTweetID-Lexicon.csv: 2,243 tweets automatically annotated using the Indonesian NRC EmoLex. 3. EmoTweetID-Human.csv: 2,243 tweets manually annotated by three psychology students, with inter-annotator agreement measured using Cohen’s and Fleiss’ Kappa. Both annotated files (EmoTweetID-Lexicon.csv and EmoTweetID-Human.csv) provide labels following Ekman’s six basic emotions: anger, disgust, fear, joy, sadness, and surprise. Additionally, two pre-trained word embedding models (Wors2Vec and FastText) trained on the corpus, TweetID-Word2Vec.zip and TweetID-FastText.zip, are provided for various downstream NLP tasks. All code used to construct the dataset is available in the GitHub repository: https://github.com/ksnugroho/EmoTweetID This dataset offers a valuable benchmark for affective computing and natural language processing in Indonesian, supporting research in emotion recognition, social media analysis, and the development of empathetic AI systems.

Files

Institutions

  • Bina Nusantara University
  • Universitas Brawijaya

Categories

Natural Language Processing, Affective Computing

Licence