Published: 28 March 2023| Version 1 | DOI: 10.17632/6zcpgchvmx.1
, Naga Sathvika N,
, pradeepa s


Tamils account for more than 20% of the population in India, making them one of the largest linguistic groups in the country. Tamil is the official language of Sri Lanka, spoken by nearly 75% of the population. In 2004 Tamil was declared a classical language of India, meaning that it met three criteria: its origins are ancient, it has an independent tradition, and it possesses a considerable body of ancient literature. In the early 21st century, more than 66 million people were Tamil speakers. Tamil handwritten character recognition satisfies the demand for converting ancient Tamil manuscripts into inscriptions, palm leaves, and copperplates. One major challenge with Tamil OCR is the lack of a complete and balanced dataset for all the alphabets in a different century. This dataset is developed to satisfy this need. This Tamil handwritten dataset is developed by acquiring the samples from B. Tech students of SASTRA deemed University. The steps involved are, 1) Collect data samples in the Tamil language from B.Tech students from SASTRA University. 2) Python Code for removing Borders - By converting the pages with borders used for recording data into Negatives and then removing the vertical and horizontal lines and converting it back into pages without borders. 3) Using Pine tools - Image Splitter splits the image into equal rows and columns. In this case, 19 rows and 13 columns, respectively. 4) Using Red Ketchup tools - Converting images into PNG format with a 300x300*3 resolution. 5) Using Bulk Utility Renamer - to rename the folders in bulk and used for much better clarity and understanding to map the alphabets and their respective folder numbers.



Shanmugha Arts Science Technology and Research Academy School of Computing


Optical Character Recognition, Affective Computing, Image Classification


Tamil Virtual Academy