Akan–English Maternal Health Parallel Text Corpus for Machine Translation
Description
This dataset contains a curated bilingual parallel corpus developed to support domain-specific neural machine translation (NMT) for maternal health communication between Akan and English. The corpus was constructed to address the scarcity of healthcare-specific parallel data for low-resource African languages, particularly Akan. The dataset comprises 20,101 cleaned English–Akan parallel sentence pairs, of which 12,100 pairs (60.2%) originate from maternal health content covering prenatal and postnatal care domains, and 8,006 pairs (39.8%) are drawn from general-domain sources to enhance linguistic diversity and model robustness. Maternal health topics represented include antenatal care, childbirth preparation, maternal mental health, nutrition, vaccination, medication use, lifestyle behaviours, preventive medicine, personal hygiene, and common pregnancy-related conditions. Value of the Dataset This dataset provides one of the first domain-specific maternal health resources for Akan that integrates both parallel text and aligned speech data, enabling research in neural machine translation, speech recognition, text-to-speech, and multimodal health communication systems. It supports the development of inclusive digital health tools such as maternal health chatbots and voice-based systems designed for Akan-speaking communities and other low-resource language contexts. This corpus was developed as part of the Ɔbaa Panin Project, which seeks to build a conversational maternal health chatbot in Akan. The Ɔbaa Panin Project is funded by Google Research.
Files
Steps to reproduce
1. Define Scope and Domain Boundaries A. Restrict corpus development to maternal health communication. B. Cover both prenatal and postnatal care domains. C. Include subtopics such as: Barriers and myths Physical exercise Mental health Childbirth preparation Medication and vaccination Nutrition and dietary guidelines Lifestyle behaviours Maternal assessment Preventive medicine Hygiene Common pregnancy-related conditions 2. Collect Source English Maternal Health Content A. Scrape or manually collect maternal-health-related content from verified international and local digital health websites. B. Extract relevant textual excerpts focusing on patient-facing information. C. Remove irrelevant, duplicated, or non-health-related content. 3. Generate Structured Question–Answer Pairs A. Design a prompt for a Large Language Model (e.g., ChatGPT-4o API). B. Use the OpenAI API to generate structured question–answer pairs from the collected excerpts. C. Process content in batches (50–100 excerpts per request). D. Ensure each output contains a clear maternal-health question and corresponding answer. 4. Categorize Content by Maternal Health Domains A. Organize generated Q&A pairs according to WHO-recommended maternal health categories. B. Remove misclassified or irrelevant pairs. 5. Manually review all generated pairs to: A. Ensure contextual appropriateness for Ghana. B. Remove culturally irrelevant or inaccurate content. C. Eliminate redundancy. 6. Submit the refined English Q&A pairs to qualified medical experts to verify: A. Clinical correctness B. Terminology accuracy C. Public-health appropriateness D. Revise or remove items based on expert feedback. 7. Translation into Akan A. Provide validated English pairs to Akan language experts (linguists). B. Use a web-based transcription app to: - Enter sentence-aligned translations - Provide 2 to 3 alternate translations of each question pair in Akan - Maintain one-to-one parallel alignment - Enforce orthographic consistency 8. Augment with General-Domain Parallel Data A. Incorporate additional English–Akan parallel datasets from existing open resources. B. Combine maternal-health corpus with general-domain corpus to enhance linguistic coverage. C. Target approximate ratio: ~60% maternal health ~40% general domain 9. Perform the following preprocessing steps: A. Remove duplicate sentence pairs B. Standardize Akan orthography C. Normalize Akan-specific characters D. Remove non-Akan characters or replace appropriately E. Ensure sentence-level alignment D. Correct spacing and punctuation inconsistencies
Institutions
- University of GhanaGreater Accra, Accra