MASAQ: Morphologically and Syntactically-Annotated Quran Dataset

Published: 22 October 2024| Version 2 | DOI: 10.17632/9yvrzxktmr.2
Contributor:
Majdi Sawalha

Description

The Morphologically-Analyzed and Syntactically-Annotated Quran (MASAQ) dataset is a comprehensive resource designed to address the scarcity of annotated Quranic Arabic corpora and facilitate the development of advanced Natural Language Processing (NLP) models. MASAQ provides a detailed syntactic and morphological annotation of the entire Quranic text, utilizing a rigorously verified text from Tanzil.net. The dataset includes more than 131K morphological entries and 123K instances of syntactic functions, covering a wide range of grammatical roles and relationships. The annotation process involved a team of expert Arabic linguists who employed traditional i'rab methodologies to ensure high accuracy and consistency. The dataset is structured in multiple formats (txt, CSV, xlsx, XML, JSON) to cater to various research needs. The potential applications of MASAQ are vast, ranging from pedagogical uses in teaching Arabic grammar to developing sophisticated NLP tools. By providing a high-quality, syntactically annotated dataset, MASAQ aims to advance the field of Arabic NLP, enabling more accurate and more efficient language processing tools. The dataset is made available under the Creative Commons Attribution 3.0 License, ensuring compliance with ethical guidelines and respecting the integrity of the Quranic text. The Morphologically-Annotated and Syntactically-Annotated Quran (MASAQ) dataset presents significant potential applications across domains. Pedagogically, it can simplify the teaching of Arabic grammar by focusing on fundamental concepts. In NLP, MASAQ can enhance tools like part-of-speech taggers and parsers, which are essential for automated language understanding. Linguistically, the dataset provides valuable syntactic analysis for linguistic research. Additionally, dependency parsers derived from MASAQ can efficiently analyze web content, resolve several types of sentence ambiguities, and contribute to semantic representations. The dataset also supports efforts like Universal Dependencies, facilitating cross-linguistic research and multilingual NLP tool development. Furthermore, integrating dependency parsing with machine learning classifiers can improve parsing accuracy and efficiency, particularly useful for languages with free word order, like Written Arabic. Overall, MASAQ offers a comprehensive resource for advancing both academic and practical applications in Arabic NLP.

Files

Steps to reproduce

The raw Quranic text was sourced from Tanzil.net, where advanced NLP techniques were employed to ensure an accurate, standardized Unicode representation. Tanzil’ text preparation process involved multiple stages: initial automatic text extraction followed by cleaning and normalization to create a core text representation, rule-based verification to ensure grammatical and recitational accuracy, and manual verification by experts cross-referencing with the Medina Mushaf. This included character and diacritic checksum calculations. Since its release in 2008, Tanzil’s rigorously developed NLP approach has provided a widely adopted, error-free Quranic text. The raw data used in MASAQ is the highly accurate and verified Quran text from Tanzil, which underwent a rigorous three-step verification process: automatic text extraction, rule-based verification, and manual verification against the Medina Mushaf. This Tanzil version, in Unicode format, has gained widespread acclaim for its absence of typos. It includes both the Uthmani and imla'i scripts, with the Uthmani script being the most authoritative and the imla'i script offering a modern representation for easier understanding and analysis. The Tanzil version is based on the 1924 Cairo edition of the Quran, endorsed by Al-Azhar University, which standardized the Ḥafṣ ‘an ‘Āṣim reading and established widely accepted verse numbering and chapter ordering. This text was utilized in compliance with the Creative Commons Attribution 3.0 License. Each row represents a word or segment from a specific verse and chapter (sura) of the Quran, including detailed morphological, syntactic, and semantic information. Each raw of MASAQ consists of 20 columns of morphological and syntactic information.

Institutions

The University of Jordan

Categories

Linguistics, Artificial Intelligence, Computational Linguistics, Natural Language Processing

Funding

University of Jordan

3/2017

Licence