Wata

Published: 5 June 2026| Version 1 | DOI: 10.17632/yvcxg7dnpw.1
Contributors:
,

Description

The Wata dataset is a semantic text classification dataset for the Sorani Kurdish Dialect. It contains 15,052 annotated text samples collected from publicly accessible social media platforms, including Facebook, Telegram, and TikTok. The dataset is organized into five semantic categories: Normal (0), Romantic (1), Advice (2), Threat (3), and Hate Speech (4). The dataset is provided in CSV format and consists of two columns: text, containing the Sorani Kurdish Dialect text, and label, containing the corresponding category identifier. Data preprocessing included noise removal, spelling verification, normalization, orthographic standardization, linguistic review, and annotation to improve consistency and usability for Natural Language Processing (NLP) research. The dataset was created to support semantic text classification and related NLP tasks in Sorani Kurdish Dialect, a low-resource language with limited publicly available benchmark datasets. The dataset contains no missing values and no duplicate text entries. Researchers can use this resource for academic and non-commercial research purposes.

Files

Categories

Computer Science, Artificial Intelligence, Natural Language Processing

Licence