FIPU-OCR-CHAR: Font-Invariant Printed Urdu Character Dataset

Name: FIPU-OCR-CHAR: Font-Invariant Printed Urdu Character Dataset
Creator: fauzia yasir
Published: 2025-12-05T00:06:28.342Z
Keywords: Optical Character Recognition, Natural Language Processing, Urdu Language, Deep Learning

yasir, fauzia; Kazmi, Majida; Kidwai, Hamza Munir; Jawwad, Yousuf; Qazi, Saad Ahmed

doi:10.17632/9cdk8y89v6.1

FIPU-OCR-CHAR: Font-Invariant Printed Urdu Character Dataset

Published: 5 December 2025| Version 1 | DOI: 10.17632/9cdk8y89v6.1

Contributors:

,

Description

The FIPU-OCR-CHAR dataset is a large-scale, font-invariant corpus of printed Urdu characters designed to support research in optical character recognition, font generalization, and script analysis. The dataset contains 337,680 labeled images across 48 Urdu classes, including 38 alphabets and 10 numerals. Each character was rendered in 201 diverse Urdu font styles and further transformed using 34 augmentation operations to simulate real-world printing, scanning, and distortion conditions. Images were rendered 28×28 PNG files with 24-bit depth. The data suggests that high font diversity and augmentation variety significantly improve the robustness and generalization capability of OCR models, as confirmed through preliminary experiments using ResNet-34. The dataset was generated programmatically from font-rendered characters and processed through controlled augmentation pipelines, producing consistent and balanced samples suitable for training, validation, and benchmarking. It can be interpreted as a foundational resource for building and evaluating deep learning models for Urdu OCR and serves as a baseline for future character-, word-, and line-level datasets.

FIPU-OCR-CHAR: Font-Invariant Printed Urdu Character Dataset

Description

Files

Steps to reproduce

Institutions

Categories

Related Links

Licence