BPS2025: A Demographically Focused Dataset of Handwritten Bangla Primary Script for Early Writer Recognition
Description
Current benchmarks for handwritten Bangla character recognition (HCR) lack data representing the handwriting of young, primary school-aged learners. We hypothesize that this demographic gap limits the development of robust Optical Character Recognition (OCR) and deep learning models for educational applications. The Bangla Primary Script 2025 (BPS2025) dataset was therefore created to address this need, positing that a large-scale, demographically focused dataset capturing the formative and variable handwriting styles of children is essential for advancing Bangla HCR research and its real-world applications in ed-tech. The BPS2025 dataset provides a curated collection of isolated handwritten Bangla characters and numerals from 500 primary school students in Bangladesh, aged 7 to 12 (Grades 2 to 5). It contains: 24,420 images across 60 balanced classes. 50 basic characters (11 vowels and 39 consonants). 10 digits (0-9). Comprehensive demographic metadata, including age, gender, grade, and district. The data reveals the distinct stylistic variations, developmental inconsistencies, and formative errors characteristic of early handwriting acquisition, which are not represented in existing datasets based on adult handwriting. Notable Findings: Demographic Specificity: BPS2025 is the first large-scale dataset exclusively focused on the handwriting of young Bangla learners. Structural Integrity: The dataset was compiled with a focus on authenticity; no artificial data augmentation was applied to preserve the genuine quality and characteristics of the original samples. Pre-processed Readiness: The data is offered in two versions to facilitate immediate use: (i) raw scanned images and (ii) processed images cleaned via a standardized 5-stage pre-processing pipeline (including binarization, noise removal, and normalization). Both versions are organized into 60 folders (00-59) by class label for straightforward integration into ML workflows. Data Interpretation and Usage: The BPS2025 dataset is a specialized resource designed to train and benchmark models for Bangla HCR, particularly in educational contexts. Researchers and developers can use this data to: Develop and test the robustness of OCR, deep learning (e.g., CNN, Transformer), and transfer learning models against the challenging variability of children's handwriting. Investigate correlations between handwriting patterns and demographic factors like age and gender at an early learning stage. Address the problem of data scarcity for a critical demographic, thereby improving the accuracy and fairness of automated recognition systems in real-world applications such as digital learning platforms, automated grading technologies. The label mapping is intuitive: folders 00–49 contain basic Bangla characters, and folders 50–59 contain digits. The processed data is further partitioned into standard training, validation, and test sets to enable immediate experimentation and reproducible research.
Files
Institutions
- Uttara University