Nupe-English parallel corpus

Published: 13 December 2023| Version 1 | DOI: 10.17632/k7dtv7k2hy.1
Umar Baba Umar


This is the first ever Nupe - English Parallel Corpus and Nupe Monolingual Corpora curated from diverse sources including poems,idioms, proverbs, religpoius text etc. The aim of this data collection is to make available a cultural-aware Nupe-english corpus for NLP Tasks such as machine translation.


Steps to reproduce

The Nupe-English Corpus Creation: Steps to Reproduce Data Collection and Preparation: Gather diverse Nupe and English texts, including traditional literature (epics, folktales, proverbs), modern literature (novels, short stories, poetry), idioms, proverbs, poems, news articles, figurative language, and religious texts (Bible, Qur'an). English-Nupe Parallel Text Construction using OCR: Employ an Optical Character Recognition (OCR) application to convert Nupe-English Bible and Qur'an texts into readable text format. Generate 20,000 sentences from the Bible and 6,600 sentences from the Qur'an. Manually translate 5,000 sentences from the Bible and 6,600 sentences from the Qur'an into Nupe. Repeat the process with additional PDFs, combining resulting sentences into a Nupe-English corpus with 'English' and 'Nupe' columns . Translating Existing French-English Corpus to Nupe-English: Remove the French column from an existing French-English corpus, leaving only the English column. Engage 5 native Nupe speakers to translate English sentences into Nupe, creating a Nupe-English parallel text . News Articles Extraction: Visit the Nigerian Television Authority (NTA) station in Bida to obtain translated news articles. Convert hardcopies into electronic formats using OCR. Develop a Nupe-EnglishExtractor to extract Nupe and English translations at the sentence level and merge them into a parallel corpus.


Abubakar Tafawa Balewa University


Natural Language Processing, Machine Translation