Kurdish News Dataset Headlines (KNDH) through Multiclass Classification

Published: 13 January 2023| Version 2 | DOI: 10.17632/kb7vvkg2th.2


Data Description The Kurdish language belongs to the Indo-Iranian family of Indo-European languages. It is well-known to be a close relative to the Persian language. The speakers span the intersections of Iran, Turkey, Iraq, and Syria. The Kurdish language is one of the official languages in Iraq and has regional status in Iran. The language has 40 million speakers [2,11]. Central Kurdish (Sorani) and Northern Kurdish (Kurmanji) are two of the main dialects of the Kurdish language [3]. However, there are other minor dialects, such as Gorani (Hawrami), spoken in some residential settings in Iraq and Iran, and Zazaki, which is used in Turkey [4]. Historically, many styles of the alphabet have been used for writing Kurdish, namely Cyrillic, Armenian, Latin, and Arabic. The dataset is the Sorani dialect which has 36 letters as vowels and constants [5] as shown in Table 1. Dataset Labeling Machine learning and deep learning tools are significantly affected by dataset labeling. Datasets can be labeled in three different ways. The first method involves reading and understanding texts through human effort. The second method is automatic labeling, which uses pre-trained annotation models to annotate the text. Semi-automatic labeling combines both human and automatic labeling as a third step. In this work, automatic labeling is used for that purpose. Thus, the annotation process is independent of human effort. Due to ParsHub's automatic category extraction, the category in which the news was published can be determined. In other words, it uses the tags written under each news headline.



University of Halabja


Machine Learning, Kurd, Text Mining