Dataset for Bangla Text Detection and Recognition

Name: Dataset for Bangla Text Detection and Recognition
Creator: Redwan Ahmed Rizvee
Published: 2023-06-20T06:09:12.406Z
Keywords: Document Analysis, Optical Character Recognition, Natural Language Processing, Text Extraction, Bengali Language

Rizvee, Redwan Ahmed; Karim, Md. Rezaul; Islam, Md. Ashraful

doi:10.17632/n57phs3k4t.1

Dataset for Bangla Text Detection and Recognition

Published: 20 June 2023| Version 1 | DOI: 10.17632/n57phs3k4t.1

Contributors:

Redwan Ahmed Rizvee, Md. Rezaul Karim, Md. Ashraful Islam

Description

This dataset can be used for Bangla Text Detection and Recognition. There are two folders and one (.xlsx) file. "Image Folder" contains the images of all the text documents. "Word Folder" contains all the text in (.docx) format for the parallel image file. So, if we have a text image "Image Folder/a/b.jpg", then we also have a corresponding text docx file "Word Folder/a/b.docx". There are, in-total, 1166 parallel documents of images and corresponding texts. The images are PDFs containing Bangla-typed texts collected from various sources, novels, stories, educational books, etc. The "Typing List.xlsx" file is the collection containing the names of the parallel jpg and docx files. During preparing the docx files, alignment is maintained to keep it similar to the corresponding image texts. The goal of this data collection problem is to gather sufficient data to train machine learning models as such so that the architecture can be used for scanning Bangla documents and extracting the texts on the fly, maintaining the spaces and alignments.

Dataset for Bangla Text Detection and Recognition

Description

Files

Categories

Licence