Dataset for Bangla Text Detection and Recognition

Published: 20 June 2023| Version 1 | DOI: 10.17632/n57phs3k4t.1
Contributors:
Redwan Ahmed Rizvee, Md. Rezaul Karim,

Description

This dataset can be used for Bangla Text Detection and Recognition. There are two folders and one (.xlsx) file. "Image Folder" contains the images of all the text documents. "Word Folder" contains all the text in (.docx) format for the parallel image file. So, if we have a text image "Image Folder/a/b.jpg", then we also have a corresponding text docx file "Word Folder/a/b.docx". There are, in-total, 1166 parallel documents of images and corresponding texts. The images are PDFs containing Bangla-typed texts collected from various sources, novels, stories, educational books, etc. The "Typing List.xlsx" file is the collection containing the names of the parallel jpg and docx files. During preparing the docx files, alignment is maintained to keep it similar to the corresponding image texts. The goal of this data collection problem is to gather sufficient data to train machine learning models as such so that the architecture can be used for scanning Bangla documents and extracting the texts on the fly, maintaining the spaces and alignments.

Files

Categories

Document Analysis, Optical Character Recognition, Natural Language Processing, Text Extraction, Bengali Language

Licence