Datasets Comparison
Version 1
Sorani Tweet Sent Dataset (STSD)
Description
The Kurdish language is part of the Indo-Iranian branch of the Indo-European language family and is closely linked to Persian. It is spoken by approximately 40 million people residing at the crossroads of Iran, Turkey, Iraq, and Syria. Kurdish is recognized as an official language in Iraq and holds regional language status in Iraq. The language comprises various dialects, with Central Kurdish (Sorani) and Northern Kurdish (Kurmanji) being the primary ones. Additionally, there are minor dialects such as Gorani (Hawrami), spoken in parts of Iraq and Iran, and Zazaki, used in Turkey. Throughout history, Kurdish has been written using different alphabets, including Cyrillic, Armenian, Latin, and Arabic. The Sorani dialect, which is the focus of this dataset, consists of 36 letters representing vowels and consonants. Despite its rich linguistic heritage and a substantial number of speakers, the Kurdish language unfortunately lacks many essential natural language processing (NLP) tools, including comprehensive datasets.
In the given context, the Kurdish Sentiment Analysis Dataset stands as a noteworthy contribution to the field of natural language processing for low-resource languages, specifically focusing on the Sorani dialect of Kurdish. The dataset initially consisted of 30,009 texts gathered from Twitter using the Twitter API 2.0. Following thorough processing and refinement, the final dataset encompasses 24,668 high-quality texts, establishing a robust foundation for various NLP tasks. The data was collected during the global COVID-19 pandemic, adding a unique contextual backdrop to the dataset. As a result, a significant proportion of the tweets revolve around COVID-19 related topics, reflecting the societal concerns and discussions prevalent during this unprecedented time. 
The dataset stands out due to its meticulous annotation process. Each tweet was manually labeled by three human annotators across multiple dimensions, greatly enhancing its suitability for machine-learning applications. The annotations encompass four key aspects: subjectivity, sentiment, offensiveness, and the presence of a specific target. This comprehensive approach to labeling enables a nuanced comprehension of the text, empowering researchers to train models for tasks such as sentiment analysis, subjectivity detection, and identifying offensive language. The creation of this dataset fills a crucial gap in language resources for Kurdish, especially the Sorani dialect. Given Kurdish's status as a low-resource language, extensive datasets essential for the development of advanced NLP models have historically been lacking. Consequently, this dataset paves the way for new avenues in research and development in Kurdish language processing, offering the potential for enhanced language technologies for Kurdish speakers.
Categories
Natural Language Processing, Machine Learning, Sentiment Analysis
Licence
Creative Commons Attribution 4.0 International
Version 2
Sorani Tweet Sent Dataset (STSD)
Description
The Kurdish language is part of the Indo-Iranian branch of the Indo-European language family and is closely linked to Persian. It is spoken by approximately 40 million people residing at the crossroads of Iran, Turkey, Iraq, and Syria. Kurdish is recognized as an official language in Iraq and holds regional language status in Iraq. The language comprises various dialects, with Central Kurdish (Sorani) and Northern Kurdish (Kurmanji) being the primary ones. Additionally, there are minor dialects such as Gorani (Hawrami), spoken in parts of Iraq and Iran, and Zazaki, used in Turkey. Throughout history, Kurdish has been written using different alphabets, including Cyrillic, Armenian, Latin, and Arabic. The Sorani dialect, which is the focus of this dataset, consists of 36 letters representing vowels and consonants. Despite its rich linguistic heritage and a substantial number of speakers, the Kurdish language unfortunately lacks many essential natural language processing (NLP) tools, including comprehensive datasets.
In the given context, the Kurdish Sentiment Analysis Dataset stands as a noteworthy contribution to the field of natural language processing for low-resource languages, specifically focusing on the Sorani dialect of Kurdish. The dataset initially consisted of 30,009 texts gathered from Twitter using the Twitter API 2.0. Following thorough processing and refinement, the final dataset encompasses 24,668 high-quality texts, establishing a robust foundation for various NLP tasks. The data was collected during the global COVID-19 pandemic, adding a unique contextual backdrop to the dataset. As a result, a significant proportion of the tweets revolve around COVID-19 related topics, reflecting the societal concerns and discussions prevalent during this unprecedented time. 
The dataset stands out due to its meticulous annotation process. Each tweet was manually labeled by three human annotators across multiple dimensions, greatly enhancing its suitability for machine-learning applications. The annotations encompass four key aspects: subjectivity, sentiment, offensiveness, and the presence of a specific target. This comprehensive approach to labeling enables a nuanced comprehension of the text, empowering researchers to train models for tasks such as sentiment analysis, subjectivity detection, and identifying offensive language. The creation of this dataset fills a crucial gap in language resources for Kurdish, especially the Sorani dialect. Given Kurdish's status as a low-resource language, extensive datasets essential for the development of advanced NLP models have historically been lacking. Consequently, this dataset paves the way for new avenues in research and development in Kurdish language processing, offering the potential for enhanced language technologies for Kurdish speakers.
Categories
Natural Language Processing, Machine Learning, Sentiment Analysis
Licence
Creative Commons Attribution 4.0 International