Cyberbullying dataset for Kurdish Language

Name: Cyberbullying dataset for Kurdish Language
Creator: Soran Badawi
Published: 2025-08-05T12:39:24.041Z
Keywords: Machine Learning, Kurd, Deep Learning, Cyber Attack

Badawi, Soran

doi:10.17632/ck49jyxcbt.4

Cyberbullying dataset for Kurdish Language

Published: 5 August 2025| Version 4 | DOI: 10.17632/ck49jyxcbt.4

Contributor:

Soran Badawi

Description

Cyberbullying has become an increasingly prevalent issue in the digital age, with the rise of social media and online communication. It can take many forms, including verbal attacks, harassment, and discrimination, and it can have serious consequences for victims, including depression, anxiety, and even suicide. While much research has been done on cyberbullying in languages such as English, Spanish, and Chinese, there has been little focus on languages spoken by smaller populations, such as Kurdish. Kurdish is a language spoken by millions of people in the Middle East, including Turkey, Iran, Iraq, and Syria. It is an Indo-European language with several dialects, and it is considered an official language in Iraq and an official regional language in Iran. Despite its widespread use, there has been very little research on cyberbullying in Kurdish, and there are currently no datasets available that specifically focus on this issue. To address this gap, we have created the first ever cyberbullying dataset for the Kurdish language. This dataset contains three classes: neutral, racism, and sexism. The neutral class includes messages that do not contain any form of cyberbullying, while the racism and sexism classes include messages that contain discriminatory language based on race or gender, respectively. The dataset was created using a combination of manual and automated techniques. We collected a large number of messages from Twitter API, that were written in Kurdish. We then manually labeled these messages based on whether they contained cyberbullying or not, and further categorized them into the three classes. The resulting dataset contains over 30,000 messages, with roughly equal distribution among the three classes. It is a valuable resource for researchers and practitioners who are interested in studying cyberbullying in the Kurdish language and developing strategies to combat it. The dataset can be used for a variety of purposes, including training machine learning models to detect cyberbullying in Kurdish, analyzing the language used in cyberbullying messages to identify patterns and trends, and developing interventions to prevent and address cyberbullying in Kurdish-speaking communities.

Cyberbullying dataset for Kurdish Language

Description

Files

Categories

Licence