DevaCAPTCHA-Audio: A Multimodal Devanagari CAPTCHA Dataset with Synthetic, Noisy, and Natural Speech
Description
This Dataset presents DevaCAPTCHA, a comprehensive multimodal Devanagari CAPTCHA dataset developed to support research in secure authentication, pattern recognition, and multimodal machine learning. The dataset comprises 10 Devanagari numerals (0–9) and 23 selected characters, organized into three CAPTCHA classes: (i) character-only, (ii) digit-only, and (iii) alphanumeric combinations of characters and digits, with variable sequence lengths of 5, 6, and 7 symbols. Each CAPTCHA instance is represented through paired digitally rendered handwritten-style text images (JPEG) and corresponding audio files (MP3), enabling direct mapping between visual and acoustic modalities. The dataset incorporates three complementary audio conditions: (i) synthetic clean speech generated using Google Text-to-Speech (gTTS), (ii) noise-augmented speech created by adding background noise to gTTS samples, and (iii) natural human speech recordings. The natural audio component is derived from an independently developed Devanagari speech dataset, which is publicly available on IEEE DataPort and Mendeley Data repositories, and further adapted to construct realistic audio CAPTCHA sequences. This integration introduces significant variability in pronunciation, speaker characteristics, and environmental conditions, enhancing real-world applicability. The combination of digitally handwritten-style CAPTCHA images, multimodal audio variations, and variable-length sequences significantly increases the complexity of automated recognition, making the dataset a challenging benchmark for machine learning-based CAPTCHA-solving systems and bots. A structured metadata file in CSV format is provided for efficient dataset utilization, and a Python-based preprocessing pipeline ensures consistency and scalability.
Files
Steps to reproduce
Dataset Usage Instructions This dataset contains a multimodal Devanagari CAPTCHA collection consisting of text images and corresponding audio files. The dataset is organized into structured folders for ease of use. 📁 Dataset Structure • The dataset is divided into three main categories: 1. Character CAPTCHA (Folder name HC) 2. Digit CAPTCHA (Folder name HD) 3. Character–Digit Combination CAPTCHA (Folder name HC-HD) • Each category is further grouped based on CAPTCHA length: o Length 5 o Length 6 o Length 7 • Each sample includes: o A JPEG image file (text CAPTCHA) o A corresponding MP3 audio file (audio CAPTCHA) 🔊 Audio Types The dataset includes three types of audio: 1. Clean Audio (generated using gTTS) 2. Noisy Audio (gTTS with background noise) 3. Natural Audio (human recorded speech) ⚙️ How to Use 1. Match each image file with its corresponding audio file using the filename. 2. Use the dataset for: o CAPTCHA recognition tasks o OCR (Optical Character Recognition) o ASR (Automatic Speech Recognition) o Multimodal learning (text + audio) 3. The dataset can be directly used in machine learning frameworks such as TensorFlow or PyTorch.
Categories
Funders
- KAVAYITRI BAHINABAI CHAUDHARI NORTH MAHARASHTRA UNIVERSITY, JALGAONGrant ID: Kavayitri Bahinabai Chaudhari North Maharashtra University, Jalgaon, India, under the Vice Chancellor Research Motivation Scheme (VCRMS)