KenLumachiQuAD - A QA dataset for Kenyan Luhya Lumarachi dialect

Name: KenLumachiQuAD - A QA dataset for Kenyan Luhya Lumarachi dialect
Creator: Barack Wanjawa
Published: 2024-09-24T08:48:39.502Z
Keywords: Computer Science, Machine Learning, Information Extraction, Language, Low-Resource LLM

Wanjawa, Barack

doi:10.17632/b6bybwnpxh.1

KenLumachiQuAD - A QA dataset for Kenyan Luhya Lumarachi dialect

Published: 24 September 2024| Version 1 | DOI: 10.17632/b6bybwnpxh.1

Contributor:

Barack Wanjawa

Description

KenLumachiQuAD is a result of a project that annotated a total of 820 QA pairs based on 137 texts of Kenyan language of Luhya, the Lumarachi dialect. These source texts are from the text data collected by the Kenyan languages corpus, Kencorpus project (https://kencorpus.maseno.ac.ke/corpus-datasets/) [1]. The total Luhya Lumarachi texts available in the Kencorpus project were 483 texts. We annotated each of the selected 137 texts with at least 5 QA pairs. The KenLumachiQuAD QA dataset is available for download as one single CSV file. Each row on the CSV file shows the reference number of the source text and the associated QA pair for that text. The columns are on the CSV file are: ‘Story_ID’ to represent the source text from the Kencorpus project, where the QA pairs are derived. The column labeled ‘Q’ contains the question text, while the column labeled ‘A’ contains the answer text. This QA dataset is a gold standard dataset annotated by human annotators who are natives of the language. It was formulated using the same modalities and quality assurance checks of a similar project that was done for the low resource language of Kiswahili [2]. This QA dataset is useful for testing machine learning QA systems for the low-resource language of Luhya, specifically the Lumarachi dialect that is predominantly spoken in Western Kenya. A semantic network approach to the QA task as applied to the Kiswahili language [2] is currently being tested on this dataset to confirm if such approach can be applicable, in such cases where there is little training data (source texts) to otherwise train deep learning systems. [1] Wanjawa, B., Wanzare, L., Indede, F., McOnyango, O., Ombui, E., & Muchemi, L. (2023). Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks. Journal for Language Technology and Computational Linguistics, 36(2), 1–27. [2] Wanjawa, B. W., Wanzare, L. D. A., Indede, F., McOnyango, O., Muchemi, L., & Ombui, E. (2023). KenSwQuAD—A Question Answering Dataset for Swahili Low-resource Language. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(4), 1–20.

KenLumachiQuAD - A QA dataset for Kenyan Luhya Lumarachi dialect

Description

Files

Categories

Licence