Tamaulipas Multiple-Choice-Question Exam Image Dataset for Optical Mark Recognition Research

Published: 18 March 2026| Version 1 | DOI: 10.17632/djmynjwjpy.1
Contributors:
Yahir Hernandez-Mier,
,
,
,

Description

This dataset was gathered in the context of Optical Mark Recognition (OMR), where the objective is to automatically detect the selected answers in a multiple-choice exam. OMR is the first step in developing automatic grading for Multiple Choice Question (MCQ) exams. This dataset contains 5721 scanned images of four-choice exam answer sheets completed by high school students. Each image is accompanied by a text file containing the human-observed labels for each item. The exams were administered in 2024 to 10th-, 11th-, and 12th-grade students at 42 high schools in Tamaulipas, Mexico. To protect the student's privacy, we developed an anonymization process based on geometric image processing. Of the answer sheets, 3669 from 10th and 12th grades contain 90 items, while 2052 from 11th grade contain 100 items, totaling 535,020 items. The variety of styles used to mark the items, as well as noise and artifacts due to human and digitization errors, makes this dataset valuable in the design of automatic OMR algorithms for real-life applications in automatic MCQ exam grading, based on machine learning or classical image processing.

Files

Institutions

Categories

Image Processing, Machine Learning, Academic Assessment

Licence