KuSarcasm: Automated Kurdish Sorani Sarcasm Dataset (KSSD)

Name: KuSarcasm: Automated Kurdish Sorani Sarcasm Dataset (KSSD)
Creator: Shakhawan Hashim
Published: 2025-05-21T12:12:13.197Z
Keywords: Annotation, Deep Learning, Sentiment Analysis

Hashim, Shakhawan; Nabi, Rebwar

doi:10.17632/3kscrg5y4y.2

KuSarcasm: Automated Kurdish Sorani Sarcasm Dataset (KSSD)

Published: 21 May 2025| Version 2 | DOI: 10.17632/3kscrg5y4y.2

Contributors:

Shakhawan Hashim, Rebwar Nabi

Description

This study presents KuSarcasm, which is an automated Kurdish Sorani Sarcasm Dataset (KSSD). KuSarcasm is a comprehensive dataset developed for detecting sarcasm in Kurdish Sorani, a low-resource language with rich morphological complexity and limited Natural Language Processing (NLP) support. The dataset was constructed through a multi-stage data collection and annotation process guided by linguistic consultation and methodological rigor. Initial data was sourced from a wide range of Kurdish cultural materials, including proverbs, poems, and idiom texts extracted from Sekhurma Magazine, Digital publishing, and online repositories. Extensive consultations with Kurdish language experts, editors, and scholars were conducted to establish annotation rules and refine the dataset's cultural and contextual relevance. Data acquisition incorporated both manual and automated methods. Publicly available texts were extracted using Optical character recognition (OCR). Additionally, more data was gathered via web scraping, manually recording data to gather information, and structured queries, resulting in over 16,000 text entries. These texts were subjected to in-depth preprocessing pipeline, including deduplication, normalization, and noise reduction. Automatic annotating process was carried out using a custom hybrid method that cooperatively multilingual sentiment classification by Multilingual-Bidirectional Encoder Representations for Transformers (MBERT) and semantic similarity scoring with sentence-Bidirectional Encoder Representations for Transformers (sBERT). This rule guided annotation strategy relied on over 100 predefined linguistic patterns to distinguish sarcastic from non-sarcastic expressions based on both emotional polarity and semantic proximity. Eventually, KuSarcasm, is annotated for binary sarcasm classification and includes metadata such as source, matched rule, and sentiment category. Given its depth, diversity, and cultural alignment, KuSarcasm holds strong reuse potential for researchers working in NLP for underrepresented languages, sentiment analysis, and computational linguistics. It also offers a valuable foundation for developing and benchmarking deep learning models in low-resource settings.

Files

Institutions

Sulaimani Polytechnic University

KuSarcasm: Automated Kurdish Sorani Sarcasm Dataset (KSSD)

Description

Files

Institutions

Categories

Licence