RAVI: Synthetic Urdu Text Image Dataset for OCR

Name: RAVI: Synthetic Urdu Text Image Dataset for OCR
Creator: Hafsa Hafeez Siddiqui
Published: 2025-06-16T23:11:39.543Z
Keywords: Artificial Intelligence, Computer Vision, Data Science, Optical Character Recognition, Natural Language Processing

Siddiqui, Hafsa Hafeez; Yamann, Syed; Iftikhar, Muneeza

doi:10.17632/mhy5vxnths.1

RAVI: Synthetic Urdu Text Image Dataset for OCR

Published: 16 June 2025| Version 1 | DOI: 10.17632/mhy5vxnths.1

Contributors:

Hafsa Hafeez Siddiqui, Syed Yamann, Muneeza Iftikhar

Description

The RAVI dataset is a synthetic image dataset designed to support the development and training of Urdu OCR (Optical Character Recognition) models. It consists of 99,000 high-resolution images (256x256 pixels), each containing a single Urdu word rendered in black text on a white background. The images are labeled with their corresponding Urdu words, enabling both supervised training and evaluation of word-level OCR systems. The text in the images is rendered using the “Jameel Noori Nastaleeq” font, a popular and widely used Nastaliq-style Urdu font, at font size 40. The dataset is organized into subfolders corresponding to the Urdu alphabet, allowing for easier categorization, retrieval, and model evaluation based on character-specific performance. This dataset is particularly valuable for researchers and developers working on CNN-based OCR systems, including both printed and future handwritten text recognition in Urdu. It can serve as a benchmark for word-level OCR models, sequence prediction architectures, and other deep learning applications in low-resource languages. Key Features: 99,000 annotated images Image resolution: 256x256 pixels Black Urdu text on white background Font: Jameel Noori Nastaleeq, size 40 Organized alphabetically by Urdu letters Suitable for training, validation, and benchmarking of OCR systems

Files

Steps to reproduce

Data Collection Methodology The dataset was synthetically generated using a controlled, reproducible workflow in Python, specifically tailored for Urdu word-level OCR tasks. Below is an outline of the complete data generation pipeline: 1. Language Script Handling Library Used: arabic_reshaper The Arabic Reshaper library was utilized to render Urdu text correctly. Since Urdu uses a modified form of the Arabic script with context-sensitive character shapes (initial, medial, final, isolated), reshaping was necessary to correctly display connected Urdu words. This is a crucial preprocessing step for any script written in the Nastaliq style. 2. Text-to-Image Rendering Font Used: Jameel Noori Nastaleeq One of the most widely used and visually natural fonts for Urdu script. Installed manually and accessed using Python's PIL (Python Imaging Library). Font Size: 40 pt Image Dimensions: 256 × 256 pixels Text Color: Black (RGB(0,0,0)) Background Color: White (RGB(255,255,255)) Rendering Library: Python Imaging Library (PIL) / Pillow Each reshaped word was rendered centrally on a white canvas using PIL. 3. Dataset Organization Data Volume: 99,000 images Labeling Protocol: Each image is labeled using the corresponding Urdu word. File names match the target label, and a metadata file (e.g., CSV or JSON) maps each image to its textual annotation. Folder Structure: Images are grouped into directories based on the first letter of each word (according to the Urdu alphabet). This structure supports better organization and analysis (e.g., per-letter performance). 4. Word List Source A curated list of frequently used Urdu words, each starting from a unique letter of the Urdu alphabet, was compiled. Care was taken to ensure: Natural, commonly spoken vocabulary. Coverage across the Urdu alphabet for linguistic diversity. 5. Environment & Tools Programming Language: Python 3.x Libraries & Packages: arabic_reshaper Pillow (PIL) matplotlib, numpy (for visualization and batch generation) Platform: Windows 10 / Ubuntu Linux (cross-compatible) Execution Method: Jupyter Notebook / Python scripts for batch processing 6. Reproducibility The entire dataset generation process is deterministic and script-driven. To reproduce the dataset: Install the required Python libraries. Load or define a list of Urdu words. Apply arabic_reshaper to each word. Render reshaped text into images using PIL and save with appropriate labels. Organize images into folders based on initial Urdu characters. Optionally, the generation scripts and configuration files can be shared to allow other researchers to generate additional variations (e.g., different fonts, colors, or sizes) for robustness testing.

Institutions

Bahria University - Karachi Campus

RAVI: Synthetic Urdu Text Image Dataset for OCR

Description

Files

Steps to reproduce

Institutions

Categories

Licence