A Vast Dataset for Kurdish Digits and Isolated Characters Recognition
Description
Kurdish language dialects are used across four main nation-states in the Middle East, and only one dialect, Sorani, has official status in one of these nation-states. The majority of Kurdish-speaking regions are located in Turkey, Iraq, Iran, and Syria. More than 30 million people speak Kurdish as a whole, according to estimates. One of the two main dialects of Kurdish, known as Central Kurdish (Sorani), is spoken by an estimated 9 to 10 million people. It is mostly written with a 35-character modified Arabic/Persian alphabet and includes characters that have recently been replaced, such as (ك) which is no longer used by the Kurdish language and has been replaced with (ک). This work presents two massive datasets for central Kurdish handwriting digits and isolated characters named K-ZHMARA and K-PIT. The first dataset, named K-ZHMARA dataset, contains 70,000 images of Kurdish digits, 7,000 images for each digit, and a printed A4 paper with a grid of 10 × 10 is used for data collection. Apart from digits, the K-PIT dataset includes 245,000 images of all Kurdish characters, 7,000 images for each character; data was collected via a printed A4 paper with a grid of 12 × 10 for this dataset. Moreover, both datasets include 315,000 images. Then, using Python programming, each piece of paper was scanned, segmented, cropped, resized, binarized, and inverted using edge detection and image processing techniques. Most students from the University of Halabja and the primary and preparatory school in the Halabja governorate volunteered to fill out the forms. Furthermore, these datasets are suitable for Kurdish isolate handwritten optical digit/character recognition. Labeling and organizing: Each image is labeled with an ID number, the number of the folder in each dataset represents a single digit or character. For example, folder number 02 in the K-PIT dataset is the id of the letter, which in this case is Alef (ا), and folder number 03 in the K-ZHMARA dataset is the id of the digit, which in this case is three (٣). Each digit and character were stored in a folder with its ID as the name of that folder, with each folder containing 6000 images of that letter/digit for the training and 1000 images for the testing.