An extensive dataset of Handwritten Central Kurdish Isolated characters.

Published: 22-03-2021| Version 2 | DOI: 10.17632/f8z9jts5nb.2
Contributors:
Rebin M. Ahmed,
Tarik Rashid,
Polla Fattah

Description

Data collection: Finding a suitable source of data is considered a first step toward building a database. The first step in building a database is finding a suitable source. Here, the main goal is to collect images of Kurdish handwritten characters written by many writers. So, a form is designed to do so. The form is shown in Figure 1. It consists of 1 alphabet at a time letter that has been printed on the top right corner, and it has 125 empty blocks. The writers have been asked to write each letter three times in the three empty blocks. The total number of writers is 390. The forms have been distributed among two main categories: The academic staff of the Information Technology department at Tishk International University, the university students of the University of Kurdistan-Hawler, Salahaddin University, and Tishk International University As shown in Table 2. In total there were ten sets of forms, each set with 35 forms for 35 different letters, at first, we decided that nine sets, which will give us at least 1100 images for each letter were the best option for the time that we had. Then there were some problems with the collection process, in first prints of the forms there was confusion for instance in Set 2, there were 2 forms for the letter (چ) and none for (ج), and since we printed and distributed the form at the same time, we were not aware of this problem until the stage of pre-processing, This was creating an inconsistency in the number of samples that we had, for example by the 9th set we had 504 images of the letter (ڤ), which was much less than other letters that they had at least 1000 images. So we decided to add the 10th set as a complementary to other sets, it only contained those letter, which was missing in the first 9 forms, which was (ز،ژ،ش،غ،ڤ،ق،ک،ل،ن،ی), as explained in Table 3, the First column is the letter and columns 2-11 represent several images gathered in each set accordingly, while the first row the header row 2-36 are letters in each set, last row, and last columns are for the total of each letter and each set. Labeling and Organizing : Each image is labeled with three numbers and separated by an underscore, the first number is the id of the letter according to its positing in the alphabetical order which is shown in Table 4, the second number being the number of the set of form which there was 10 sets each giving to a specific group of writers, the third number is the order of that character in the form which was between 1 to 126, so each image had a label like following 02_01_94.jpg, 02 is the order of the letter which in this case is Alef (١), then 01 being in the set number 1 which was given to 4th-grade students of Information Technology department in Tishk International University, and 94 is the order of that image in the form. Each letter was stored in a folder with its ID as the name of that folder, with each folder containing approximately 1134 images of that letter.

Files