AraSTEM

Published: 26 March 2025| Version 2 | DOI: 10.17632/rn4zbzg8z2.2
Contributors:
,
,
,
,
, Hassan Al Husseini,

Description

AraSTEM is a dataset designed to evaluate the knowledge of large language models (LLMs) in STEM (Science, Technology, Engineering, and Mathematics) subjects in Arabic. It consists of multiple-choice questions covering various topics and difficulty levels, requiring models to demonstrate a deep understanding of scientific Arabic. The dataset includes the question, options, correct answer, subject, level, and a link to the resource.

Files

Steps to reproduce

AraSTEM data is collected from various Arabic-language sources. Content is manually extracted from Arabic PDF books that focus on STEM topics, and multiple-choice questions (MCQs) are scraped from Arabic-language online platforms dedicated to STEM education. Additionally, MCQs are manually extracted from Arabic web sources, and medical MCQ exams are digitized using Optical Character Recognition (OCR), followed by manual corrections to ensure accuracy. After the data is collected, it is organized into a structured format that includes the questions, options, correct answers, subjects, difficulty levels, and a resource link for each entry.

Institutions

American University of Beirut

Categories

Arabic Language, Science, Technology, Engineering and Mathematics, Large Language Model

Funding

AUB AI, Data Science, and Computing Hub

Licence