Wata

Name: Wata
Creator: Ibrahim Hasan
Published: 2026-06-05T21:19:40.560Z
Keywords: Computer Science, Artificial Intelligence, Natural Language Processing

Hasan, Ibrahim; Jamal Abdulhameed Al-Atroshi, Salar

doi:10.17632/yvcxg7dnpw.1

Wata

Published: 5 June 2026| Version 1 | DOI: 10.17632/yvcxg7dnpw.1

Contributors:

,

Description

The Wata dataset is a semantic text classification dataset for the Sorani Kurdish Dialect. It contains 15,052 annotated text samples collected from publicly accessible social media platforms, including Facebook, Telegram, and TikTok. The dataset is organized into five semantic categories: Normal (0), Romantic (1), Advice (2), Threat (3), and Hate Speech (4). The dataset is provided in CSV format and consists of two columns: text, containing the Sorani Kurdish Dialect text, and label, containing the corresponding category identifier. Data preprocessing included noise removal, spelling verification, normalization, orthographic standardization, linguistic review, and annotation to improve consistency and usability for Natural Language Processing (NLP) research. The dataset was created to support semantic text classification and related NLP tasks in Sorani Kurdish Dialect, a low-resource language with limited publicly available benchmark datasets. The dataset contains no missing values and no duplicate text entries. Researchers can use this resource for academic and non-commercial research purposes.

Wata

Description

Files

Categories

Licence