Amharic health question answering dataset

Name: Amharic health question answering dataset
Creator: berhanu bogale
Published: 2026-01-05T11:27:12.154Z
Keywords: Natural Language Processing

bogale, berhanu; tegegne, tesfa; Abate, Solomon Teferra; Belay, Gebeyehu

doi:10.17632/8cks7m5f8s.1

Amharic health question answering dataset

Published: 5 January 2026| Version 1 | DOI: 10.17632/8cks7m5f8s.1

Contributors:

berhanu bogale,

,

Description

The **AmHQA** dataset is an Amharic health question answering corpus curated to support research in low-resource language natural language processing and medical question answering. The dataset consists of **1600 question–answer pairs for training** and **400 pairs for testing**, all provided in **CSV format**. The content is written entirely in **Amharic** and is intended to facilitate the development and evaluation of extractive and neural question answering systems in the health domain. AmHQA is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0) licence**, allowing unrestricted use, distribution, and adaptation with appropriate attribution. Researchers using this dataset are requested to cite it as: *Bogale, B., et al. (2026). AmHQA: An Amharic Health Question Answering Dataset. Mendeley Data*. For further information or inquiries, please contact **[berhanubogale0101@gmail.com](mailto:berhanubogale0101@gmail.com)** (mobile: **0938282528**).

Files

Steps to reproduce

The dataset was constructed by translating selected question–answer pairs from the lavita/MedQuAD dataset into Amharic using Google Translate, followed by basic cleaning and normalization to ensure linguistic consistency

Institutions

Bahir Dar University Institute of Technology

Amharic health question answering dataset

Description

Files

Steps to reproduce

Institutions

Categories

Licence