KuSarcasm: Automated Kurdish Sorani Sarcasm Dataset (KSSD)
Description
This study presents KuSarcasm, which is an automated Kurdish Sorani Sarcasm Dataset (KSSD). KuSarcasm is a comprehensive dataset developed for detecting sarcasm in Kurdish Sorani, a low-resource language with rich morphological complexity and limited Natural Language Processing (NLP) support. The dataset was constructed through a multi-stage data collection and annotation process guided by linguistic consultation and methodological rigor. Initial data was sourced from a wide range of Kurdish cultural materials, including proverbs, poems, and idiom texts extracted from Sekhurma Magazine, Digital publishing, and online repositories. Extensive consultations with Kurdish language experts, editors, and scholars were conducted to establish annotation rules and refine the dataset's cultural and contextual relevance. Data acquisition incorporated both manual and automated methods. Publicly available texts were extracted using Optical character recognition (OCR). Additionally, more data was gathered via web scraping, manually recording data to gather information, and structured queries, resulting in over 16,000 text entries. These texts were subjected to in-depth preprocessing pipeline, including deduplication, normalization, and noise reduction. Automatic annotating process was carried out using a custom hybrid method that cooperatively multilingual sentiment classification by Multilingual-Bidirectional Encoder Representations for Transformers (MBERT) and semantic similarity scoring with sentence-Bidirectional Encoder Representations for Transformers (sBERT). This rule guided annotation strategy relied on over 100 predefined linguistic patterns to distinguish sarcastic from non-sarcastic expressions based on both emotional polarity and semantic proximity. Eventually, KuSarcasm, is annotated for binary sarcasm classification and includes metadata such as source, matched rule, and sentiment category. Given its depth, diversity, and cultural alignment, KuSarcasm holds strong reuse potential for researchers working in NLP for underrepresented languages, sentiment analysis, and computational linguistics. It also offers a valuable foundation for developing and benchmarking deep learning models in low-resource settings.
Files
Institutions
- Sulaimani Polytechnic University