Oromo Auto-Grammar Dataset

Name: Oromo Auto-Grammar Dataset
Creator: Ebisa Gemechu
Published: 2023-05-03T06:53:35.951Z
Keywords: Statistical Natural Language Processing, Natural Language Generation

Gemechu, Ebisa; Ramasubramanian, Kanagachidambaresan

doi:10.17632/n5wg3mbp9r.1

Oromo Auto-Grammar Dataset

Published: 3 May 2023| Version 1 | DOI: 10.17632/n5wg3mbp9r.1

Contributors:

Ebisa Gemechu, Kanagachidambaresan Ramasubramanian

Description

This contribution is a novel dataset called Oromo-grammar-dataset. The dataset is prepared using a custom Python algorithm. To prepare the dataset, we used a sample of 200KB (about 100 Pages of raw text) collected from online sources. Our algorithm performed well to automatically generate a grammar-aware dataset for the Oromo language. The method can easily be reproducible to any other language with a systematic analysis and slight modifications to its affix structures to generate similar datasets. The output of the software is a grammar-rich dataset, which is applicable to modern NLP applications like machine translation, sentence completion, and grammar and spell checker. The dataset also helps linguists and academia in teaching language grammar structures.

Oromo Auto-Grammar Dataset

Description

Files

Categories

Related Links

Licence