KHLD: A Large-Scale Benchmark of the Kurdish Handwritten Lines Dataset for Low-Resource Central Kurdish (Sorani)

Published: 22 April 2026| Version 2 | DOI: 10.17632/ffjb8kfb7m.2
Contributors:
,
,
, Tarik A. Rashid,

Description

The Kurdish Handwritten Lines Dataset (KHLD) is a large-scale image dataset aiming to facilitate handwritten text recognition and optical character recognition (OCR) systems on Central Kurdish (Sorani dialect). The dataset fills a resource gap in Kurdish, an estimated 30-50 million language speakers, by offering a high-quality, varied set of handwritten line images, usable to train and evaluate machine learning systems. The dataset consists of 47,944 handwritten line images in 4,802 folders, which consist of 10 handwriting variations of the same Kurdish sentence. Sentences were lengthened by 4-7 words and were taken out of websites, books, articles and social media to cover both formal and ordinary language. The samples were all native Sorani Kurdish speakers (ages 15-55) who were recruited at universities and institutes in Erbil, Kurdistan Region, Iraq and written on standardised A4 paper in black or blue ink. The scanning was done at 600-1,200 DPI and converted to JPEG and pre-processed by conversion to grayscale and resizing. The fine-tuned YOLOv8 model was used to perform line segmentation with an accuracy of more than 95% and was complemented with template matching based on OpenCV. The images are accompanied by a structured XLSX metadata file that contains typed Kurdish sentence text and line IDs of each sample. The dataset also records the orthographic peculiarities of the Central Kurdish language, such as the characters of the script peculiar to the language (e.g., ڤ, ڕ, ڵ, ێ, ۆ) that are not found in Arabic or Persian. Organization contributor: Artificial Intelligence and Innovation Center (AIIC), University of Kurdistan Hewler, Erbil, KR, Iraq.

Files

Categories

Artificial Intelligence, Computer Vision, Handwriting Recognition, Natural Language Processing, Big Data, Meta-Analysis, Documentary Analysis, Pattern Recognition, Language, Meta Dataset, Transformer-Based Deep Learning, Document Layout Analysis

Licence