Bangla Multilabel Cyberbully, Sexual Harrasment, Threat and Spam Detection Dataset
Description
Dataset Overview The Bangla Multilabel Cyberbully, Sexual Harassment, Threat, and Spam Detection Dataset is designed to facilitate the development of machine learning models to detect and classify various types of abusive content in Bangla social media text. This dataset contains a collection of comments annotated for multiple types of abuse, making it suitable for multilabel classification tasks. It aims to support research and development in natural language processing (NLP) to enhance online safety and moderate harmful content on Bangla language social media platforms. Purpose 1. Train and evaluate machine learning models for detection of cyberbullying, sexual harassment, religious hate speech, threats, and spam in Bangla comments. 2. Support research in NLP and machine learning focused on Bangla, a low-resource language. 3. Aid in developing automated moderation systems for social media platforms to ensure safe and respectful communication. Data Collection Initially, we collected around 30,000 comments from social media platforms like Facebook and TikTok. These comments were in Bangla, English, and Banglish (Bangla written using English characters). Since our research focuses on Bangla abusive text detection, we refined the dataset through the following steps: 1. We filtered out all comments written in English to focus on the Bangla text. 2. To ensure data quality, We eliminated duplicate entries and rows with missing or null values. 3. We removed any remaining English characters and both Bangla and English numerical values to ensure the analysis was based solely on Bangla text. After these steps, we obtained a final dataset of 12,557 comments. Each comment was manually labeled into five classes: bully, sexual, religious, threat, and spam. This dataset supports multi-class labeling, meaning a comment can simultaneously belong to more than one class. Dataset Columns 1. Gender: Indicates the gender of the person who received the bullying. 2. Profession: Indicates the profession of the person who received the bullying. 3. Comment: Contains the text of the comment in Bangla. 4. Bully: Binary label indicating whether the comment contains bullying content. (0 for no, 1 for yes) 5. Sexual: Binary label indicating whether the comment contains sexual harassment content. (0 for no, 1 for yes) 6. Religious: Binary label indicating whether the comment contains religious hate speech. (0 for no, 1 for yes) 7. Threat: Binary label indicating whether the comment contains threats. (0 for no, 1 for yes) 8. Spam: Binary label indicating whether the comment is considered spam. (0 for no, 1 for yes) Applications 1. Training and testing machine learning models for multilabel classification. 2. Research on natural language processing (NLP) and cyberbullying detection in low-resource languages like Bangla. 3. Developing automated systems for monitoring and moderating online content on social media platforms to ensure safe and respectful communication.