BN-HTRd: A Benchmark Dataset for Document Level Offline Bangla Handwritten Text Recognition (HTR)

Published: 21 April 2023| Version 4 | DOI: 10.17632/743k6dm543.4
Contributors:
,
,
,
,
,
,
,
,
,
,
,

Description

We introduce a new dataset for offline Handwritten Text Recognition (HTR) from images of Bangla scripts comprising words, lines, and document-level annotations. The BN-HTRd dataset is based on the BBC Bangla News corpus - which acted as ground truth texts for the handwritings. Our dataset contains a total of 786 full-page images collected from 150 different writers. With a staggering 1,08,18 instances of handwritten words, distributed over 14,383 lines and 23,115 unique words, this is currently the 'largest and most comprehensive dataset' in this field. We also provided the bounding box annotations (YOLO format) for the segmentation of words/lines and the ground truth annotations for full-text, along with the segmented images and their positions. The contents of our dataset came from a diverse news category, and annotators of different ages, genders, and backgrounds, having variability in writing styles. The BN-HTRd dataset can be adopted as a basis for various handwriting classification tasks such as end-to-end document recognition, word-spotting, word/line segmentation, and so on. The statistics of the original dataset are given below: ------------------------------------------------- Number of writers = 150 Total number of images = 786 Total number of lines = 14,383 Total number of words = 1,08,18 Total number of unique words = 23,115 Total number of punctuation = 7,446 Total number of characters = 5,74,203 ------------------------------------------------- # From v3.0, we are also providing automatic bounding box annotations (YOLO format) of 805 document images containing words/lines. The statistics of the automatic annotations are given below: ------------------------------------------------- Number of writers = 87 Total number of images = 805 Total number of lines = 14,836 Total number of words = 1,06,135 -------------------------------------------------

Files

Steps to reproduce

# See the Paper (Links below). 1) https://arxiv.org/abs/2206.08977 2) https://www.routledge.com/Computer-Vision-and-Image-Analysis-for-Industry-40/Siddique-Arefin-Ahad-Dewan/p/book/9781032164168 # To reproduce and compare the results with our future benchmarks: -- Splitted Dataset for Model Training: https://huggingface.co/datasets/shaoncsecu/BN-HTRd_Splitted -- Current Benchmark: https://paperswithcode.com/dataset/bn-htrd

Institutions

University of Chittagong, Universitat Politecnica de Catalunya, Premier University

Categories

Handwriting Recognition, Document Imaging, Annotation, Image Acquisition, Image Segmentation, Bengali Language, Word Recognition, Textual Analysis

Licence