Moroccan Darija Offensive Language Detection Dataset

Name: Moroccan Darija Offensive Language Detection Dataset
Creator: Anass Ibrahimi
Published: 2024-07-09T08:45:41.273Z
Keywords: Natural Language Processing, Arabic Language, Sentiment Analysis

Ibrahimi, Anass; Mourhir, Asmaa

doi:10.17632/2y4m97b7dc.3

Moroccan Darija Offensive Language Detection Dataset

Published: 9 July 2024| Version 3 | DOI: 10.17632/2y4m97b7dc.3

Contributors:

Anass Ibrahimi, Asmaa Mourhir

Description

The Moroccan Darija offensive language detection dataset is a human-labeled dataset consisting of a set of Moroccan Darija sentences for offensive language detection. The dataset contains 20,402 sentences and their corresponding binary labels: 0 for a non-offensive sentence and 1 for an offensive sentence. The sentences were gathered from Twitter and YouTube comments and are written in both Latin and Arabic scripts. Inoffensive sentences account for 62.2% (12,685 sentences), while offensive sentences account for 37.8% (7,717 sentences). This contribution addresses the scarcity of labeled datasets for Moroccan Darija and provides a resource for natural language processing researchers interested in Moroccan Darija, particularly offensive language and sentiment analysis tasks. Note: It is important to note that this dataset contains profanity and offensive sentences. Users are advised to exercise caution when exploring the dataset, especially in educational or sensitive contexts. We recommend that all researchers and users consider the potential presence of offensive language when using this resource for natural language processing tasks, particularly for offensive language and sentiment analysis. Please be aware of the dataset's content before accessing it.

Files

Institutions

Al Akhawayn University in Ifrane

Moroccan Darija Offensive Language Detection Dataset

Description

Files

Institutions

Categories

Licence