MedQA-MA; Question Answering Dataset in Moroccan Arabic for the Healthcare Domain

Published: 9 July 2025| Version 1 | DOI: 10.17632/v6gs7nsy9z.1
Contributors:
,

Description

MedQA-MA: A Moroccan Arabic Healthcare Question Answering Dataset MedQA-MA is the first publicly available dataset dedicated to question answering (QA) in the Moroccan Arabic dialect within the healthcare domain. Developed to address the critical shortage of medical resources in low-resource dialectal Arabic, this dataset fills a vital gap in the field of clinical natural language processing (NLP) for North Africa. MedQA-MA comprises over 114,000 carefully curated question-answer pairs, each labeled with one of 23 distinct medical specialties, including Psychiatry, Cardiology, Pediatrics, Dermatology, Oncology, Internal Medicine, Neurology, and more. The questions are formulated in Moroccan Darija, the most widely spoken dialect in Morocco, while the answers provide medically accurate, concise, and contextually relevant responses. The dataset unifies several previously fragmented sub-corpora, standardizing the labeling and taxonomy of specialties—for example, merging "General Medicine" and "General Practitioner" into a single category to enhance consistency and usability. This dataset holds high potential for a wide range of NLP and AI applications, particularly in underrepresented languages and domains. It can serve as a foundation for training and evaluating: Open-domain and closed-domain QA systems Dialogue agents for health advice Text classification and intent detection models Named Entity Recognition (NER) in medical context Multilingual and dialectal machine translation systems Moreover, MedQA-MA supports research in fairness, robustness, and accessibility of AI in healthcare, particularly for Arabic-speaking communities who lack access to culturally and linguistically relevant NLP technologies. Its rich domain coverage makes it valuable not only for building Darija-specific models but also for enhancing generalization in multilingual medical systems. In sum, MedQA-MA is a pioneering contribution to the Arabic NLP landscape. It empowers researchers, developers, and public health stakeholders to build intelligent systems that better reflect the linguistic realities of Moroccan users. As the first structured healthcare QA dataset in Moroccan Arabic, it opens the door for inclusive, high-impact advancements in AI for medicine in the Global South.

Files

Categories

Medical Assistant, Artificial Intelligence, Data Science, Health Care, Arabic Language, Medical Care in Morocco, Virtual Assistant, Large Language Model

Licence