DEVANAGARI CAPTCHA DATASET OF 1 Million Images : A challenge Test
CAPTCHA (Completely Automated Public Turing Tests to Tell Computers and Humans Apart). Only humans can successfully complete this test; current computer systems cannot. It is utilized in several applications for both human and machine identification. Text-based CAPTCHAs are the most typical type used on websites. Most of the letters in this protected CAPTCHA script are in English, it is challenging for rural residents who only speak their native tongues to pass the test. Devanagari characters have more complex characters than standard English characters and numeral-based CAPTCHAs, which makes machine recognition much more difficult. The majority of official websites in India only offer information in Devanagari. Unfortunately, websites do not use Devanagari CAPTCHAs.As a result, we have created a new text-based CAPTCHA in Devanagari script in this article. A computer/printed font and handwritten Devanagari character(34 each) and number(10 each) , in total 44+44 = 88 character images are used to design CAPTCHA. General CAPTCHA generation principles are used to add noise to the image using digital image processing techniques. Size of each CAPTCHA image is 250 X 90 pixels. 04 (Four) types of Character Sets are used – Printed Alphabet(34), Handwritten Alphabet(34), Printed Digit(10), and Handwritten Digit(10). Generated 11 Classes from these 04 combinations. The string length of the CAPTCHA image considered here is FIVE, SIX, and SEVEN ( 5, 6, 7). For each class – 03 (THREE) subclasses are created depending upon string length. In total there are 11 classes X 3 subclasses = 33 subclasses. So 33 types of CAPTCHA images were generated. For each class, 10,000 CAPTCHA images were created. For 11 Classes X 10,000 images , a Devanagari CAPTCHA Data set of 1,10,000 ( One Million Ten Thousand) images were created using Python. To make the CAPTCHA image less recognized or not easily broken. Passing a test with identifying Devanagari alphabets is difficult. It is beneficial to researchers who are investigating captcha recognition in this area. This dataset is helpful to researcher to design OCR for recognize Devanagari CAPTCHA and break it.
Steps to reproduce
Devanagari Character set used : Selected symbols of Devanagari Charcater Set : Numeral's (10) - ० १ २ ३ ४ ५ ६ ७ ८ ९, Vowels(4) - अ इ उ ए , Consonants (30)- क ख ग घ च छ ज झ ट ड ढ ण त थ द ध न प ब भ म य र ल व श ष स ह ळ. 11 classes of character set PA-34 PD-10 HA-34 HD-10 PA-34 PD-10 HA-34 HD-10 PA-34 PD-10 HA-34 HD-10 PA-34 PD-10 HA-34 HD-10 PA-34 PD-10 HA-34 HD-10 PA-34 PD-10 HA-34 HD-10 PA-34 PD-10 HA-34 HD-10 PA-34 PD-10 HA-34 HD-10 PA-34 PD-10 HA-34 HD-10 PA-34 PD-10 HA-34 HD-10 PA-34 PD-10 HA-34 HD-10 1. 33 CSV FILES ARE UPLOADED IN NEW FOLDER. 2.33 ZIP FILES OF CAPTCHA IMAGE FILES ARE UPLOADED IN NEW FOLDER For a design a Devanagari CAPTCHA standard guideline are followed. Image processing techniques used. Experimental Environment: This project is implemented in the Jupyter platform in the Windows environment using Python language, with its version 3.0.0 dated 20 Feb 2020. Computer Hardware requirement: Processor : Intel(R), Core™, i5 or new versions, CPU @ 2.20 GHz,8 GB RAM, 4 GB NVIDIA GEFORCE GTX GPU System type: 64-bit Windows operating system Software requirement: Python, Tensorflow, Keras Python requires less time to execute the code than Matlab so selected python for implementation. For implementation used different libraries in Python NumPy , Pandas, Scikit-learn (Sklearn), TensorFlow, Keras, OpenCV, PyGame, PyTorch, Tesseract OCR. OpenCV (Open Source Computer Vision library) technique used Morphing: Merging through a smooth transition different pictures to create a new one. Used OpenCV functions read, write, display, resize, translate, scale and rotate image.