JODA - A Dataset of Jordanian Dialect and Erroneous Modern Arabic Sentences coupled with Proper MSA and Full Diacritics
Description
The Jordanian Dialect Arabic (JODA) dataset is a carefully constructed corpus designed to support advancements in Arabic Natural Language Processing (NLP), particularly in the areas of dialectal language processing and error correction for Modern Standard Arabic (MSA). It consists of 59,135 text sequences derived from informal Jordanian Arabic and formal MSA containing linguistic errors. Each input sequence is aligned with two corrected versions: one in non-diacritized MSA and another fully diacritized. The dataset was compiled from a diverse range of sources, including public user-generated comments on social media platforms (Facebook, Instagram, YouTube, X/Twitter), transcriptions of Jordanian films, and existing Arabic dialect corpora. All entries were preprocessed to remove non-linguistic content and personally identifiable information. Sentences were then segmented into shorter units to enhance their usability in downstream machine learning applications. Manual annotation was performed by expert linguists: Jordanian dialect sentences were translated into proper MSA , while erroneous MSA inputs were edited to conform to proper spelling and grammar conventions. Each of the 59,135 entries in the JODA dataset is represented as rows comprising the following components: - Source – Indicates the origin of the text, whether from social media (e.g., Facebook, YouTube, X, Instagram), transcribed Jordanian movies, or existing public Arabic dialect datasets, notably the SDC and DART corpora. - Text – Contains the original input sentence - Type – A binary classification field: value 0 denotes sentences in Jordanian Dialect, while value 1 indicates erroneous MSA sequences. - Corrected Text – The corresponding sentence corrected into proper MSA, without diacritics. - Diacritized Text – The same corrected sentence in MSA, fully annotated with diacritics. The dataset is divided into three files: - diacritized_train_set.xlsx (54,135 text sequences) - diacritized_test_set.xlsx (2500 text sequence) - diacritized_valid_set.xlsx (2500 text sequence) Credit: Given that this dataset is governed by the CC BY 4.0 license, please refer and cite the following publication: G. Abandah, M. Khaleel, I. Jafar, M. Abdel-Majeed, Y. Hamdan, A. Suyyagh, A. Abdel-Karim, S. AlAwawdeh, “Jordanian Arabic to Modern Standard Arabic Translation Using a Large Model Tuned on a Purpose-Built Dataset and Synthetic Error Injection,” Jordanian Journal of Computers and Information Technology (JJCIT), Accepted for publication, Jun 2025. Credit of the diacritized version: R. Otoum, "A Dual-Function Large Language Model for Correcting Arabic Spelling Mistakes and Adding Diacritics: Bridging Jordanian Dialect and Formal Arabic," MSc Thesis, The University of Jrodan, Jun 2025. Kindly also cite the dataset in this repository.
Files
Institutions
- The University of Jordan
Categories
Funders
- Ministry of Higher EducationMalaysia