Moroccan Darija Offensive Language Detection Dataset
Description
The Moroccan Darija dataset was cleaned by removing duplicate entries and discarding sentences with conflicting annotations. To address class imbalance, undersampling was applied to reduce the size of the majority (non-offensive) class. The dataset was also augmented with samples from the OMCD corpus, which underwent the same preprocessing pipeline to ensure consistency, including emoji representation, normalization, removal of punctuation and diacritics, elimination of social media elements, elongation removal, and duplicate removal. Finally, all entries from the Moroccan Darija dataset were relabeled using Claude 3.5 Sonnet to align with the comprehensive OMCD framework, covering both explicit and implicit forms of offensiveness such as vulgarity, hate speech, hostile intent, contempt, humiliation, and belittlement (OMCD reference). Sentences where Claude-generated labels conflicting with previous annotations were flagged for manual review according to break the tie.
Files
Institutions
- Al Akhawayn University in Ifrane