Oromo Auto-Grammar Dataset

Published: 3 May 2023| Version 1 | DOI: 10.17632/n5wg3mbp9r.1
Ebisa Gemechu,


This contribution is a novel dataset called Oromo-grammar-dataset. The dataset is prepared using a custom Python algorithm. To prepare the dataset, we used a sample of 200KB (about 100 Pages of raw text) collected from online sources. Our algorithm performed well to automatically generate a grammar-aware dataset for the Oromo language. The method can easily be reproducible to any other language with a systematic analysis and slight modifications to its affix structures to generate similar datasets. The output of the software is a grammar-rich dataset, which is applicable to modern NLP applications like machine translation, sentence completion, and grammar and spell checker. The dataset also helps linguists and academia in teaching language grammar structures.



Statistical Natural Language Processing, Natural Language Generation