ArPanEmo: An Open-Source Dataset for Fine-Grained Emotion Recognition in Arabic Online Content during COVID-19 Pandemic.
The dataset consists of 11,128 online posts manually labelled for ten emotion categories or neutral, with Fleiss’ kappa of 0.71. We have considered ten emotions, including six of the basic human emotions that were originally proposed by Ekman (1999): anger, disgust, fear, happiness, sadness, and surprise. Additionally, we have incorporated four other emotions that we deemed pertinent to the pandemic period, namely, anticipation, confusion, optimism, and pessimism. The number of instances are relatively balanced across emotion categories in ArPanEmo dataset. ArPanEmo is unique in that it focuses on a specific dialect, namely Saudi, and covers topics related to healthcare and the ways in which the COVID-19 pandemic has affected various aspects of life. The online posts were collected from three distinct online sources: YouTube comments, Online newspaper comments Twitter. Regarding the part of ArPanEmo corpus collected from Twitter, we only release Tweet ID along with annotations due to Twitter’s Service Terms restricting the distribution of tweet contents. Researchers can use Twitter ID to collect the tweets contents when conducting their studies on the dataset. To ensure reproducibility, specific training and test portions are well defined for the ArPanEmo dataset. The dataset is formatted in CSV, where each row represents an online post collected from one of the three aforementioned resources. The file includes three columns: number, post, and label. The number represents the Twitter ID, and in the case of a Youtube or online newspaper comment, the number is set to 0. The post column contains the collected online post, and in the case of Twitter, the post is replaced with three dots (...). The label column indicates tthe corresponding emotion label for each post. *** NOTE *** Version 2 of the ArPanEmo dataset differs from version 1 only in that it includes the correct training set file (ArPanEmo_train.csv), whereas in version 1, the training set file (ArPanEmo_train.csv) was mistakenly a duplicate of the test set file (ArPanEmo_test.csv).