PromitoLipi: A versatile offline dataset of handwritten Bangla words and paragraphs
Description
The PromitoLipi Dataset contains two different datasets, PromitoLipi1.1 and PromitoLipi2.1. PromitoLipi1.1 contains 80 single-page handwritten paragraphs of individuals of different personalities and ages, with 7231 words. Among these datasets, most of the paragraphs are comprehensive. They can be used for handwritten line/word segmentation, paragraph recognition, and multimodal natural language processing tasks like document summarization, sentiment analysis, and content extraction. Also, some paragraphs are incomprehensive/incoherent( random unassociated words are written one after another) and are primarily helpful for segmentation-based tasks. On the other hand, PromitoLipi2.1 contains 9830 open vocabulary word images consisting of 24050 consonant/vowel/diacritic/number/conjunct/punctuation instances and their corresponding annotation files. This dataset can be used for handwritten character segmentation and word recognition. For 70% of the words in this dataset, Writers of different ages from different areas were asked to write random words on paper. For the rest of the 30%, words were collected from different CMATERdb datasets to create versatility in the dataset. Then, the word images were segmented and binarized(background pixels in white and foreground/text pixels in black) so that each image contained no irrelevant information besides the word text consisting of single/multiple classes.
Files
Steps to reproduce
For PromitoLipi1.1, the writings were conducted using regular stationery products. Here, different colors of paper(e.g., White, Black, Blue, Red, Yellow, and Orange) have been used. Different types of gel pens, glittering pens, ball-point pens with different colors(e.g., Black, Blue, Orange, Green, Red, and Pink), and pencils have been used. Writers were advised to write on a random topic. The handwritings were further captured using scanners and smartphone cameras. Each captured image was cropped, but it was ensured that in some images, shadow interference/effects and inclusion of background elements (e.g., tabletop, floor)have been preserved as it is expected to have illumination and background issues in a smartphone-captured picture. The converted.zip contains a converted dataset produced using the referred script. The PromitoLipi1.1 folder contains the paragraph images of the dataset. On the other hand, for PromitoLipi2.1, the writings were conducted using regular stationery products. Writers were advised to write random words on a paper. The handwritings were further captured using scanners and smartphone cameras, and then they were preprocessed and labeled accordingly. The PromitoLipi2.1 folder contains the preprocessed word images and their corresponding annotation files of this dataset.