A Vast Dataset for Kurdish Digits and Isolated Characters Recognition

Name: A Vast Dataset for Kurdish Digits and Isolated Characters Recognition
Creator: Peshraw Ahmed Abdalla
Published: 2022-12-22T11:47:57.801Z
Keywords: Optical Character Recognition, Handwriting Recognition, Image Database

Abdalla, Peshraw Ahmed; Jabar, Abdalla Taha; Salam, Ali Abdalla; Hama Amin, Hedi Hamid

doi:10.17632/zb66pp7vjh.1

A Vast Dataset for Kurdish Digits and Isolated Characters Recognition

Published: 22 December 2022| Version 1 | DOI: 10.17632/zb66pp7vjh.1

Contributors:

, Abdalla Taha Jabar, Ali Abdalla Salam, Hedi Hamid Hama Amin

Description

Kurdish language dialects are used across four main nation-states in the Middle East, and only one dialect, Sorani, has official status in one of these nation-states. The majority of Kurdish-speaking regions are located in Turkey, Iraq, Iran, and Syria. More than 30 million people speak Kurdish as a whole, according to estimates. One of the two main dialects of Kurdish, known as Central Kurdish (Sorani), is spoken by an estimated 9 to 10 million people. It is mostly written with a 35-character modified Arabic/Persian alphabet and includes characters that have recently been replaced, such as (ك) which is no longer used by the Kurdish language and has been replaced with (ک). This work presents two massive datasets for central Kurdish handwriting digits and isolated characters named K-ZHMARA and K-PIT. The first dataset, named K-ZHMARA dataset, contains 70,000 images of Kurdish digits, 7,000 images for each digit, and a printed A4 paper with a grid of 10 × 10 is used for data collection. Apart from digits, the K-PIT dataset includes 245,000 images of all Kurdish characters, 7,000 images for each character; data was collected via a printed A4 paper with a grid of 12 × 10 for this dataset. Moreover, both datasets include 315,000 images. Then, using Python programming, each piece of paper was scanned, segmented, cropped, resized, binarized, and inverted using edge detection and image processing techniques. Most students from the University of Halabja and the primary and preparatory school in the Halabja governorate volunteered to fill out the forms. Furthermore, these datasets are suitable for Kurdish isolate handwritten optical digit/character recognition. Labeling and organizing: Each image is labeled with an ID number, the number of the folder in each dataset represents a single digit or character. For example, folder number 02 in the K-PIT dataset is the id of the letter, which in this case is Alef (ا), and folder number 03 in the K-ZHMARA dataset is the id of the digit, which in this case is three (٣). Each digit and character were stored in a folder with its ID as the name of that folder, with each folder containing 6000 images of that letter/digit for the training and 1000 images for the testing.

Files

Institutions

University of Halabja

A Vast Dataset for Kurdish Digits and Isolated Characters Recognition

Description

Files

Institutions

Categories

Licence