MASAQ: Morphologically-Analyzed and Syntactically-Annotated Quran Dataset

Published: 9 December 2024| Version 6 | DOI: 10.17632/9yvrzxktmr.6
Contributors:
Majdi Sawalha,
,
,
,

Description

The Morphologically-Analyzed and Syntactically-Annotated Quran (MASAQ) dataset is a high-quality, annotated resource designed to advance Arabic Natural Language Processing (NLP). Covering the entire Quran, MASAQ includes over 131K morphological and 123K syntactic entries, verified by expert linguists using traditional i'rab methodologies. Available in multiple formats, it supports a range of applications—from teaching Arabic grammar to enhancing NLP tools like parsers and taggers. By enabling precise language analysis, MASAQ fosters advancements in Arabic NLP and cross-linguistic research, licensed under Creative Commons for ethical use.

Files

Steps to reproduce

The raw data used in MASAQ is the highly accurate and verified Quran text from Tanzil, which underwent a rigorous three-step verification process: automatic text extraction, rule-based verification, and manual verification against the Medina Mushaf. This Tanzil version, in Unicode format, has gained widespread acclaim for its absence of typos. It includes both the Uthmani and imla'i scripts, with the Uthmani script being the most authoritative and the imla'i script offering a modern representation for easier understanding and analysis. The Tanzil version is based on the 1924 Cairo edition of the Quran, endorsed by Al-Azhar University, which standardized the Ḥafṣ ‘an ‘Āṣim reading and established widely accepted verse numbering and chapter ordering. This text was utilized in compliance with the Creative Commons Attribution 3.0 License. Each row represents a word or segment from a specific verse and chapter (sura) of the Quran, including detailed morphological, syntactic, and semantic information. A row consists of 20 columns of morphological and syntactic information. To perform the annotation, different sets of human experts in Morphology and in Syntax had to be recruited.

Institutions

The University of Jordan

Categories

Linguistics, Artificial Intelligence, Computational Linguistics, Natural Language Processing

Funding

University of Jordan

3/2017

Licence