FIPU-OCR-CHAR: Font-Invariant Printed Urdu Character Dataset
Description
The FIPU-OCR-CHAR dataset is a large-scale, font-invariant corpus of printed Urdu characters designed to support research in optical character recognition, font generalization, and script analysis. The dataset contains 337,680 labeled images across 48 Urdu classes, including 38 alphabets and 10 numerals. Each character was rendered in 201 diverse Urdu font styles and further transformed using 34 augmentation operations to simulate real-world printing, scanning, and distortion conditions. Images were rendered 28×28 PNG files with 24-bit depth. The data suggests that high font diversity and augmentation variety significantly improve the robustness and generalization capability of OCR models, as confirmed through preliminary experiments using ResNet-34. The dataset was generated programmatically from font-rendered characters and processed through controlled augmentation pipelines, producing consistent and balanced samples suitable for training, validation, and benchmarking. It can be interpreted as a foundational resource for building and evaluating deep learning models for Urdu OCR and serves as a baseline for future character-, word-, and line-level datasets.
Files
Steps to reproduce
See the README file
Institutions
- NED University of Engineering and Technology