Ethiopian Family Code QA Dataset

Published: 20 September 2024| Version 1 | DOI: 10.17632/hj8m6mff8c.1
Contributor:
Beimnet Bekele Guta

Description

This dataset contains collection of question-and-answer pairs that have been collected in two ways. The first method is to manually extract the question-and-answer pair from the revised family code of Ethiopia. The data generation process involves a review of each article of the family code of Ethiopia and generating questions and their answer for those question from the article they were extracted from. After the extraction each question-and-answer pair each of them was reviewed by people with domain knowledge to ensure the accuracy of them. Moreover, there was a second-round review to ensure the meaning accuracy of each pair. This dataset is created to fine tune llama-2 model to create a model that will be able to answer questions related with the revised family code of Ethiopia without utilization of any back and forth translation. The second method to generate the question answer pair is to use ChatGPT to generate the pairs. The English version of the Family Code articles was given to ChatGPT as an input, which in turn generated relevant questions and answers based on the content of the family code. These question-answer pairs were then translated from English to the Amharic using Google Translate, the translated dataset was manually reviewed by Amharic speaking team members to validate the quality of the translations and correct mistakes that were made during the translation process.

Files

Institutions

Addis Ababa Institute of Technology

Categories

Natural Language Processing, Machine Learning, Llama, Deep Learning, Large Language Model

Licence