Quranic
Description
This Quranic dataset addresses the critical need for comprehensive, computationally accessible linguistic resources for Classical Arabic (CA). The underlying premise is that the lack of such resources, particularly a complete machine-readable syntactic layer, hinders CA NLP advancement. This dataset demonstrates the feasibility of constructing such a resource for the entire Holy Quran using computational methods combined with expert validation. The data (~132,736 tokens) comprises three integrated layers: Orthographic: Includes standard Imlaai and Quran-specific Uthmani scripts, Buckwalter and phonetic transliterations, English translation, and dual (Quranic/sentence-based) indexing. Morphological: Features fine-grained Part-of-Speech tagging, detailed morphosyntactic features (case, mood, aspect, etc.), lemma, and root information based on refined, expert-validated schemas. Syntactic: Provides the first complete, computationally processable syntactic annotation for the entire Quran using a novel hybrid Constituency-Dependency framework. Data collection involved sourcing foundational text and annotations from public resources (Tanzil, Quranic Corpus, Comprehensive Islamic Library). Custom Python scripts handled orthographic processing, morphological re-annotation, and syntactic seed data preparation (image-to-text conversion). A Deep Learning parser (BiLSTM architecture utilizing custom Word2Vec embeddings derived from classical texts) generated the comprehensive syntactic layer. All layers underwent rigorous manual validation, including expert review and crucially cross-referencing the generated syntax against authoritative I'rab (grammatical analysis) references. Notable findings embodied by this dataset itself include the successful large-scale application of a hybrid syntactic annotation model to the entire Quran and the effective integration of rich, multi-faceted linguistic information within a unified structure. Data is presented primarily in an extended CoNLL-X tabular format, accompanied by auxiliary files (lexicons, schemas). Interpretation and Reuse: This Quranic dataset serves as a crucial benchmark for CA NLP. Researchers can use it to train and evaluate parsers, morphological analyzers, POS taggers, diacritization models. It offers rich empirical data for theoretical linguistics and a foundation for pedagogical tools, digital humanities projects, and other CA language technologies. An associated analytical tool (Noor) aids visualization and exploration. Users should note the syntactic layer, while extensively validated, awaits further exhaustive manual curation to reach definitive gold-standard status.
Files
Steps to reproduce
Methods Data acquisition commenced with publicly available foundational Quranic resources: text scripts and positional data from the Tanzil Project, and initial morphological/syntactic data (including partial parse tree images) from the Quranic Corpus project. Additional linguistic information and texts for embedding training were sourced from the Comprehensive Islamic Library. Orthographic layer construction: Employed custom Python scripts for sentence and word tokenization applied to both Uthmani and Imlaai scripts. Further Python algorithms generated standard Buckwalter transliterations and English phonetic transliterations for each token. English translations were aligned at the verse level using custom scripts. All generated orthographic data underwent manual verification. Morphological layer enhancement: Involved refining existing Part-of-Speech (POS) and feature schemas based on analysis of 7 authoritative CA grammatical references and consultation with linguistic experts. A custom Python algorithm was developed and applied for automatic re-annotation of the entire corpus according to these enhanced, fine-grained schemas. Subsequent thorough manual validation and correction by annotators ensured adherence to the new schemas. Syntactic layer generation (novel contribution): This four-stage process began with programmatic collection (web crawling via Python) of parse tree images from the Quranic Corpus (~40% coverage). Custom algorithms (Python-based, likely involving image analysis logic) were developed to convert these images into structured, machine-readable syntactic data. A Deep Learning parser, specifically a BiLSTM architecture, was then developed and trained; crucially, this model utilized custom Word2Vec word embeddings pre-trained on a large, curated corpus of classical Arabic texts (sourced from the Comprehensive Islamic Library) to capture rich contextual information. This parser generated complete hybrid constituency-dependency annotations for the entire Quran. The final stage involved rigorous manual validation by experts, critically including detailed comparison against 7 gold-standard reference works on Quranic I'rab (grammatical analysis). The overall workflow integrated automated data processing (Python scripts, Deep Learning models) with essential expert-driven refinement (schema design) and comprehensive manual curation and validation across all layers. No specific non-standard hardware was required beyond typical computational resources for scripting and model training.
Institutions
Categories
Funding
Princess Nourah bint Abdulrahman University
PNURSP2025R263