A Comprehensive Dataset of Complete Kurdish Handwritten Characters and Digits
Description
The Kurdish language is written using two different styles of writing. Currently, it is written in Latin in Turkey and Syria, and Arabic in Iraq and Iran. Our dataset solely looks at Sorani (Central Kurdish) and mostly written with 34 characters and distributed over 4 different forms using a modified Arabic script. Since the Sorani script is written in a cursive way, the shape of each letter changes depending on its position within a word whether isolated, initial, medial, or final. Hopefully, our dataset has all the character forms. This work introduces two comprehensive datasets: one encompassing the full set of Kurdish characters in the Central Kurdish script and another dedicated to handwritten digits. To collect handwritten samples systematically, each character set was printed on A4 sheets arranged in a grid of 12 × 14 cells, for each letter and digit, 84 copies of the sets have been prepared, guaranteeing a comprehensive and varied set of data and providing clear instructions for volunteers. Once the sheets were filled out, they underwent a thorough review to ensure accuracy. The forms were then scanned using high-quality equipment to capture the handwriting details effectively. Following the scanning process, Python programming was employed to segment, extract, and resize each character for further analysis. A total of 123,984 samples were collected, primarily from volunteer students and staff from various universities in Erbil, contributing to the diversity of the dataset. The dataset includes handwritten images of Central Kurdish characters in their isolated, initial, medial, and final forms, along with digits from 0 to 9. Each character and digit have 1,008 samples. During the processing phase, every letter or digit was cropped into a separate image and saved in a dedicated folder, with each image assigned a unique ID. All images across the dataset were standardized in size. Assigning Labels and Arrangements The labeling process involved assigning three numbers to each image, separated by underscores. The first number represents the letter’s ID based on its alphabetical order. The second number corresponds to the form set, with 84 sets representing different groups of writers. The third number indicates the character’s or digit’s position in the form, ranging from 1 to 12, following a left-to-right order. For example, an image labeled "37_13_5.jpg" signifies that the character is the initial form of Pe (پـ), and the image occupies the fifth position in the form. Each letter was stored in a corresponding folder with 1,008 images for that specific letter or digit.