A Dataset for the Classification of Different Kurdish Dialects

Published: 1 August 2023| Version 1 | DOI: 10.17632/srkp2j4v93.1
Karwan Mahdi Rawf,


Kurdish is an Indo-Iranian language that is largely spoken by people of Kurdish descent in the countries of Turkey, Iraq, Iran, and Syria. It contains a number of regional dialects, the most common of which is Northern Kurdish (also known as Kurmanji or Badini), while Central Kurdish (also known as Sorani) is spoken in some regions of Iraq and Iran. A unique Kurdish dialect, Hawrami, often referred to as Gorani, is the primary language spoken in the Hawraman area, which spans portions of western Iran and northeastern Iraq. Despite the fact that the dialects have separate pronunciations, vocabularies, and certain grammatical distinctions, they share a common core. In spite of the difficulties, the Kurdish language continues to be an essential component of Kurdish identity and cultural legacy. It plays an essential role in the protection and promotion of the distinct cultural identity of the Kurdish people. The concepts of language and dialect recognition are intricately interconnected within the fields of linguistics and natural language processing. Having a good dataset for Kurdish dialect recognition improves identification and classification, natural language processing applications for Kurdish, preservation of Kurdish linguistic heritage, cultural insights, customized content and services for users, empowerment of local businesses, and a benchmark for evaluating dialect recognition systems. The presented dataset was gathered by numerous members of the University of Halabja's Computer Science Department's teaching staff over the course of several months. During each stage of the data collecting process, the established policies, procedures, and guidelines were adhered to. This included taking into consideration the ages as well as the genders of the speakers who were included in the dataset. The recordings are taken from a variety of TV programmes and TV interviews that were broadcast on Speda tv, NRT, and GK Sat. There were 2000 instances of the Sorani dialect, 2000 examples of the Badini dialect, and 2000 examples of the Hawrami language. The total duration of this dataset is 6000 s, and the duration of each sample is precisely one second. The dataset labeling procedure that has been suggested consists of two consecutive stages. Initially, it is necessary to categorize the distinct sounds of each dialect, namely Sorani, Badini, and Hawrami, into different directories. Following this, it is recommended that the files included inside these folders be systematically labeled from 1 to 2000, according to the prescribed scheme: for Sorani files, the labels should range from s1 to s2000; for Badini files, the labels should range from b1 to b2000; and for Hawrami files, the labels should range from h1 to h2000.



University of Halabja


Linguistics, Speech Analysis, Dialect, Speech Identification, Recognition, Dialectology