EACS Dataset: A Real-World Academic Admit Card Dataset from Bangladeshi Educational Institutions for OCR and Document Verification Research
Description
The EACS Dataset (Exam Admit Card Dataset) is a specialized image-based dataset developed to address challenges in document verification and automated data entry within educational administrative systems. The dataset was collected over a four-year period (2022–2025) from the operational ERP platform (NationalSchoolIntra) of a Bangladeshi educational institution located in Chittagong. It comprises 4,407 unique admit card images, each standardized to a resolution of 800 × 460 pixels and stored in JPG format to ensure a balance between visual clarity and computational efficiency for machine learning applications. A key characteristic of the dataset is its structural complexity. Each admit card contains nine distinct textual fields, including Student Name, Father’s Name, Mother’s Name, Roll Number, Student ID, Class, Session, Group, and Semester. The dataset captures real-world variability in font styles (predominantly serif fonts such as Times New Roman), alignment inconsistencies, and layout variations, making it particularly suitable for evaluating the robustness of Optical Character Recognition (OCR) systems and Document AI models. To facilitate supervised learning and benchmarking, a corresponding CSV metadata file is provided, containing manually verified ground-truth annotations for each field. This enables precise evaluation at both character-level and word-level accuracy. To ensure privacy protection and ethical compliance, all sensitive personal identifiers (such as phone numbers, email addresses, photographs, and other confidential attributes) have been removed or anonymized. Additionally, personal names (e.g., student, father, mother names , student id , student roll number) have been systematically randomized while preserving their structural and linguistic characteristics, ensuring that no real individual can be identified while maintaining the dataset’s utility for OCR research. The dataset is intended for academic and non-commercial research, particularly in the areas of document analysis, OCR benchmarking, and automated verification systems. By making this dataset publicly available, we aim to support the development of secure, AI-driven solutions for educational document processing and fraud detection.
Files
Steps to reproduce
1.Download the EACS dataset, including admit card images and the corresponding CSV metadata file. 2.Load the dataset into a preferred programming environment (e.g., Python with libraries such as OpenCV, NumPy, and OCR frameworks). 3.Preprocess the images if necessary (e.g., resizing, grayscale conversion, binarization, or noise reduction). 4.Apply an Optical Character Recognition (OCR) model (e.g., Tesseract, EasyOCR, or other Document AI frameworks) to extract textual information from the admit card images. 5.Match the extracted text fields with the ground-truth annotations provided in the CSV file. 6.Evaluate performance using metrics such as character-level accuracy, word-level accuracy, or field-level matching. 7.Conduct robustness testing by applying transformations such as rotation, blur, noise, or lighting variations to assess model performance under real-world conditions.