Thai–English Multiscript Text Image (TEMS) Dataset
Description
The Thai-English Multiscript Text Image (TEMS) Dataset is introduced as a benchmark dataset for multilingual text recognition in natural scene images. The dataset comprises 5,000 cropped text images systematically extracted from 1,625 high-resolution natural scene photographs captured using smartphone cameras. Text regions were collected from diverse real-world environments, including billboards, commercial storefronts, road signs, menus, packaging, and publication covers, representing realistic visual conditions commonly encountered in practical applications. The TEMS dataset includes 161 unique character classes, comprising Thai script characters (44 consonants, 20 vowels, 4 tone marks, 3 punctuation marks, and 10 Thai numerals), Roman characters (52 uppercase and lowercase letters), Arabic numerals (10 digits), and 18 special characters.
Files
Steps to reproduce
1. Original Image Acquisition and Analysis The dataset originated from 1,625 natural scene photographs captured using smartphone cameras. Analysis of the source images identified 161 unique character classes, comprising Thai consonants, vowels, tone marks, punctuation marks, Thai numerals, Roman characters, Arabic numerals, and special characters. 2. Text Region Extraction and Standardization From the 1,625 source photographs, 5,000 cropped text images were systematically generated by extracting regions containing complete and clearly visible text. Each cropped image was stored in JPEG format, and no external annotation files, such as XML files, were provided. 3. Dataset Labeling via File Naming To maintain a consistent dataset structure, text annotations were embedded directly into the filenames. Each image file follows a standardized naming convention that combines a unique identifier with the corresponding text string appearing in the image.
Institutions
- Mahasarakham UniversityMaha Sarakham, Maha Sarakham
Categories
Funders
- Mahasarakham UniversityMaha Sarakham