Moroccan Darija Offensive Language Detection Dataset

Published: 23 September 2025| Version 4 | DOI: 10.17632/2y4m97b7dc.4
Contributors:
Anass Ibrahimi, Asmaa Mourhir

Description

The Moroccan Darija dataset was cleaned by removing duplicate entries and discarding sentences with conflicting annotations. To address class imbalance, undersampling was applied to reduce the size of the majority (non-offensive) class. The dataset was also augmented with samples from the OMCD corpus, which underwent the same preprocessing pipeline to ensure consistency, including emoji representation, normalization, removal of punctuation and diacritics, elimination of social media elements, elongation removal, and duplicate removal. Finally, all entries from the Moroccan Darija dataset were relabeled using Claude 3.5 Sonnet to align with the comprehensive OMCD framework, covering both explicit and implicit forms of offensiveness such as vulgarity, hate speech, hostile intent, contempt, humiliation, and belittlement (OMCD reference). Sentences where Claude-generated labels conflicting with previous annotations were flagged for manual review according to break the tie.

Files

Institutions

  • Al Akhawayn University in Ifrane

Categories

Natural Language Processing, Arabic Language, Sentiment Analysis

Licence