AranjiyyaCorpus: A span-annotated Arabic news dataset of anglicised style, calques, and borrowings across six domains
Description
The Arabic neologism Aranjiyya (عَرَنْجِيَّة) is a portmanteau of ʿArabiyya (Arabic) and Inkliziyya / Faranjiyya (English/foreign), used by contemporary Arabic editors and stylists to describe Arabic prose that retains Arabic vocabulary while importing the syntactic, stylistic, semantic, or lexical structures of English. The phenomenon is pervasive in translated news, press releases, technical writing, and digital media, and is a recurrent target of prescriptive Arabic-style guides. Despite its prominence, Aranjiyya has had almost no presence in computational Arabic resources: existing treebanks and error corpora target orthographic, morphological, or syntactic well-formedness but do not isolate contact-induced patterns whose surface forms are grammatical but whose underlying templates are English. AranjiyyaCorpus was constructed to fill this gap, with three motivating use cases: training a span-level Aranjiyya detector for editors and translation post-editors; producing evaluation data for whether large language models actually generate idiomatic Arabic; and supporting linguistic study of contact-induced change in modern Arabic, with sufficient category granularity to distinguish syntactic, stylistic, and semantic phenomena and to track them across genres.
Files
Institutions
- King Saud UniversityRiyadh Region, Riyadh