Contextual Question-Answer Dataset for the Ethiopian Family Code

Published: 20 September 2024| Version 1 | DOI: 10.17632/v4h2rscmwj.1
Contributor:
Beimnet Bekele Guta

Description

This dataset is created to fine tune llama-2 model to create a model that will be able to answer questions related with the revised family code of Ethiopia without utilization of any back-and-forth translation. The dataset contains collection of question-answer-context that have been collected in two ways. The first method is to manually extract the question-and-answer pair from the revised family code of Ethiopia. The data generation process involves a review of each article of the family code of Ethiopia and generating questions and their answer for those question from the article they were extracted from. The context text from which the answer was generated for the given question is also added in the dataset. After the extraction each question-and-answer pair each of them was reviewed by people with domain knowledge to ensure the accuracy of them. Moreover, there was a second-round review to ensure the meaning accuracy of each pair. The second method to generate the question answer pair is to use ChatGPT to generate the pairs. The English version of the Family Code articles was given to ChatGPT as an input, which in turn generated relevant questions and answers based on the content of the family code. These question-answer pairs were then translated from English to the Amharic using Google Translate, the translated dataset was manually reviewed by Amharic speaking team members to validate the quality of the translations and correct mistakes that were made during the translation process. The context for each question-and-answer pair is added manually after a careful review process.

Files

Institutions

Addis Ababa Institute of Technology

Categories

Natural Language Processing, Machine Learning, Llama, Deep Learning, Large Language Model

Licence