POS Tagging on Handwritten Sindhi Sentences

Published: 30 December 2024| Version 1 | DOI: 10.17632/phk66sgmp5.1
Contributors:
,

Description

This dataset consists of high-resolution images of handwritten Sindhi sentences, meticulously curated for tasks such as Part-of-Speech (POS) tagging and Named Entity Recognition (NER). The dataset aims to facilitate research and development in natural language processing (NLP) and optical character recognition (OCR) for low-resource languages like Sindhi. Key Features: Language: Sindhi (script-based with unique linguistic characteristics). Dataset Size: Contains 1000+ labeled images with diverse handwriting styles. Annotations: Each image is manually annotated for POS tagging and NER tasks, ensuring high accuracy. Applications: Suitable for training and evaluating machine learning models in NLP, OCR, and language understanding. Diversity: Includes variations in sentence length, word structure, and handwriting styles to mimic real-world scenarios.

Files

Steps to reproduce

we have collected this dataset from various handwriting styles, from school students to university students. we told them to write the sentences as there wish then we captured those images and applied preprocessing steps to make this dataset.

Institutions

University of Sindh

Categories

Computer Vision, Optical Character Recognition, Handwriting Recognition, Annotation, Natural Language Processing, Machine Learning, Artificial Intelligence Programming Language, First Language Use in Language Learning

Licence