Moroccan Darija Offensive Language Detection Dataset

Name: Moroccan Darija Offensive Language Detection Dataset
Creator: Anass Ibrahimi
Published: 2023-09-20T08:18:27.648Z
Keywords: Natural Language Processing, Arabic Language, Sentiment Analysis

Ibrahimi, Anass; Mourhir, Asmaa

doi:10.17632/2y4m97b7dc.1

Moroccan Darija Offensive Language Detection Dataset

Published: 20 September 2023| Version 1 | DOI: 10.17632/2y4m97b7dc.1

Contributors:

Anass Ibrahimi, Asmaa Mourhir

Description

The Moroccan Darija offensive language detection dataset is a human-labeled dataset consisting of a set of Moroccan Darija sentences for offensive language detection. The dataset contains 20,402 sentences and their corresponding binary labels: 0 for a non-offensive sentence and 1 for an offensive sentence. The sentences were gathered from Twitter and YouTube comments and are written in both Latin and Arabic scripts. Inoffensive sentences account for 62.2% (12,685 sentences), while offensive sentences account for 37.8% (7,717 sentences). This contribution addresses the scarcity of labeled datasets for Moroccan Darija and provides a resource for natural language processing researchers interested in Moroccan Darija, particularly offensive language and sentiment analysis tasks.

Files

Institutions

Al Akhawayn University in Ifrane

Moroccan Darija Offensive Language Detection Dataset

Description

Files

Institutions

Categories

Licence