CBD2023:A Hypercomplex Bangla Handwriting Character Recognition Dataset for Hierarchical Class Expansion Using Deep Learning

Published: 14 November 2023| Version 5 | DOI: 10.17632/p8988t5cwg.5
jabed omor bappi


This dataset comprises approximately 80,000 meticulously organized Bangla character images, serving as a valuable resource for research in Bangla character recognition. Featuring 583 distinct character classes, including numerical characters, it provides a comprehensive foundation for researchers exploring various machine learning algorithms, developing deep neural networks, or conducting comparative studies in the field. To generate this dataset, Bangla words were initially written on A4-sized pages. Photocopies of the text were distributed to individuals, primarily students from Nazirhat Collegiate High School and Nazirhat College. Participants reproduced the text on another A4-sized paper based on the photocopy. The resulting dataset is organized as a zip file containing two main folders: "traindata" and "testdata." Traindata: Together Folder: This folder contains all images without labels. The corresponding labels and image names are stored in a CSV file called "full_df_train.csv." The CSV file has three columns: "image_name," "Label," and "long_label." The "Label" column categorizes classes into broader groups such as 'consonant,' 'vowel,' 'compound,' 'number,' 'kar-fola,' etc. The "long_label" column provides more detailed labels, with 583 individual classes denoted by numbers like 1, 2, 3, 4, etc. Total_datafinal Folder: This folder is independent and contains 583 subfolders, each representing a distinct class. Images within each subfolder correspond to the respective class name. Unlike the "Together" folder, the images here are labeled independently, making it suitable for different use cases. The "long_label" in the "Together" folder and the subfolder names in this folder are identical. Testdata: Test_Datatogether Folder: Similar to the "Together" folder in the training data, this folder contains images without labels. The corresponding labels and image names are stored in a CSV file connected to it. Test_Data Folder: This folder is independent and mirrors the structure of the "Total_datafinal" folder in the training set. It consists of 583 subfolders, each representing a class, with images stored accordingly. The meticulous organization ensures the dataset's usability and accessibility for various research applications in the realm of Bangla character recognition. Researchers can leverage this dataset for tasks such as Bangla character recognition, with the flexibility to use the labeled or unlabeled versions based on their specific requirements.



Image Analysis