Arabic Handwritten Alphabets, Words and Paragraphs Per User (AHAWP)

Published: 6 December 2021| Version 1 | DOI: 10.17632/2h76672znt.1
Contributor:
Majid Khan

Description

The dataset contains 65 different Arabic alphabets (with variations on begin, middle, end and regular alphabets), 10 different Arabic words (that encompass all Arabic alphabets) and 3 different paragraphs. The dataset was collected anonymously from 82 different users. Each user was asked to write each alphabet and word 10 times. A userid uniquely but anonymously identifies the writer of each alphabet, word and paragraph. In total, the dataset consists of 53199 alphabet images, 8144 word images and 241 paragraph images. - The file "isolated_alphabets_per_alphabet.zip" contains 53199 Arabic alphabets organized into one folder per alphabet variation - The file "isolated_alphabets_per_user.zip" contains 53199 Arabic alphabets organized into one folder per user - The file "isolated_words_per_user.zip" contains 8144 Arabic words organized into one folder per user - The file "paragraphs_per_user.zip" contains 241 Arabic paragraphs organized into one folder per user - The file "raw_dataset.zip" contains all user input forms in raw (unprocessed) format. - The scripts folder contains the Python scripts used to extract and preprocess alphabets, words and sentences from the raw input forms.

Files

Steps to reproduce

Data was collected from users using templates for alphabets, words and sentences (as shown in the raw_data folder). The user submitted forms were color scanned at 300dpi resulting in an image of resolution 2480X3507 pixels. These scanned images are provided as raw data in the folder named ''raw_data'' in the repository. The dataset from scanned images was then extracted using Python scripts. The scripts are provided with the dataset in the root directory. Following is a brief description of the scripts: 1a_alphabet_extractor_per_alphabet.py: This script extracts alphabets from the scanned JPEG images and organizes them in a folder structure with one folder per alphabet containing that alphabet written by all the users. Each file name has the format ''userid_alphabetName_variationName_index'' where index increases sequentially for each extracted alphabet from a page. 1a_alphabet_extractor_per_user.py: This script extracts alphabets from the scanned JPEG images and organizes them in a folder structure with one folder per user containing all the alphabets written by that user. 2a_alphabets_pre_processing.py: This script is used to pre-process the extracted alphabets. It crops the alphabets from the center of image (excluding 20 pixels on each side). This was done to remove any borders surrounding the extracted alphabets. The surrounding whitespace around written alphabets was then removed. The resultant image was converted to grayscale and scaled to a height of 128 pixels (keeping the aspect ratio intact). Please note that keeping aspect ratio is important so that handwriting does not get distorted. 1w_word_extractor_per_user.py: This script extracts words from the scanned JPEG images and organizes them in a folder structure with one folder per user containing all the words written by that user. 2w_words_pre_processing.py: This script is used to pre-process the extracted words. It crops the words from the center of image (excluding 5 pixels on each side). This was done to remove any borders surrounding the extracted words. The surrounding whitespace around written words was then removed. The resultant image was converted to grayscale and scaled to a height of 128 pixels (keeping the aspect ratio intact). Please note that keeping aspect ratio is important so that handwriting does not get distorted. 1p_paragraph_extractor_per_user.py: This script extracts paragraphs from the scanned JPEG images and organizes them in a folder structure with one folder per user containing all the paragraphs written by that user.

Institutions

Prince Mohammad Bin Fahd University

Categories

Document Analysis, Handwriting Recognition, Handwriting

License