POS Tagging on Handwritten Sindhi Sentences
Description
This dataset consists of high-resolution images of handwritten Sindhi sentences, meticulously curated for tasks such as Part-of-Speech (POS) tagging and Named Entity Recognition (NER). The dataset aims to facilitate research and development in natural language processing (NLP) and optical character recognition (OCR) for low-resource languages like Sindhi. Key Features: Language: Sindhi (script-based with unique linguistic characteristics). Dataset Size: Contains 1000+ labeled images with diverse handwriting styles. Annotations: Each image is manually annotated for POS tagging and NER tasks, ensuring high accuracy. Applications: Suitable for training and evaluating machine learning models in NLP, OCR, and language understanding. Diversity: Includes variations in sentence length, word structure, and handwriting styles to mimic real-world scenarios.
Files
Steps to reproduce
we have collected this dataset from various handwriting styles, from school students to university students. we told them to write the sentences as there wish then we captured those images and applied preprocessing steps to make this dataset.