ToxLex_bn: A Curated Dataset of Bangla Toxic Language Derived from Facebook Comment

Published: 27 April 2022| Version 2 | DOI: 10.17632/9pz8ssmc49.2
Contributor:
Mamun Or Rashid

Description

ToxLex or Lexicon of toxic language is a dataset having the aggressive and abusive bad words used in social media, Specifically, this dataset contains utterances from the user-generated comments of Facebook. The texts cover the demographic and thematic distribution of Bangla's toxic language on social media. The data have been extracted from 8 publicly open Facebook pages. This dataset is a curated, de-duplicated, anonymized dataset that is derived from raw comments. The dataset contains 1959 rows with 08 columns and each row represents a toxic bigram with its corresponding features such as transcriptions, translation, spelling standards, and degree of toxicity. This dataset is single human-annotated and curated to define classifiers for toxic language detection systems. Apart from this, it is considered a wordlist having Bangla cyberbullying, hate speech, and slang. Warning: this dataset contains text content that may be distressing or upsetting.

Files

Institutions

Jahangirnagar University

Categories

Bengali Language, Bullying, User-Generated Content

Licence