KurdiSent: A Corpus For Kurdish Sentiment Analysis

Published: 3 February 2023| Version 1 | DOI: 10.17632/3yrkswy6ph.1
Contributor:
Soran Badawi

Description

The Kurdish language is regarded as one of the less-resourced languages. The language is globally practised by 30-40 people. The language has 33 letters that are largely similar to the Arabic language. The Kurdish language has two major dialects Sorani and Badini. The dataset includes a collection of texts written in the Sorani dialect. It contains both tweets and comments from giant social media platforms such as Twitter, Facebook and YouTube. Due to security reasons and following the policies of both Twitter, Facebook and YouTube, we removed the user's identity. We collected the tweets and comments which was published during the time of the Corona Virus pandemic. The tweets and comments are raw texts, and the content covers a varied range of topics, starting from politics, sports, entertainment, social life, etc. Data Labeling The Facepager was employed to crawl the comments from both Facebook and YouTube. Moreover, we used the Twitter developer to mine the tweets. The dataset was annotated manually by three Kurdish native speakers. The annotators were required to identify the classes and categories of each text. The classes included positive, negative and neutral and the categories consisted of news, technology, art, social and health. The texts which were agreed upon by at least two annotators to possess a specific label and category were regarded as conflict-free and accepted for further processing. Other texts that caused conflict among all three raters were ignored and have been removed from the dataset. The doccano program was used to help the annotators label each text one by one.

Files

Categories

Machine Learning, Deep Learning, Sentiment Analysis

Licence