Amharic health question answering dataset
Description
The **AmHQA** dataset is an Amharic health question answering corpus curated to support research in low-resource language natural language processing and medical question answering. The dataset consists of **1600 question–answer pairs for training** and **400 pairs for testing**, all provided in **CSV format**. The content is written entirely in **Amharic** and is intended to facilitate the development and evaluation of extractive and neural question answering systems in the health domain. AmHQA is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0) licence**, allowing unrestricted use, distribution, and adaptation with appropriate attribution. Researchers using this dataset are requested to cite it as: *Bogale, B., et al. (2026). AmHQA: An Amharic Health Question Answering Dataset. Mendeley Data*. For further information or inquiries, please contact **[berhanubogale0101@gmail.com](mailto:berhanubogale0101@gmail.com)** (mobile: **0938282528**).
Files
Steps to reproduce
The dataset was constructed by translating selected question–answer pairs from the lavita/MedQuAD dataset into Amharic using Google Translate, followed by basic cleaning and normalization to ensure linguistic consistency
Institutions
- Bahir Dar University Institute of Technology