Bangla NLP Dataset for Sentiment Analysis, Topic Classification, and Hate Speech Detection
Description
Introduction: The Bangla NLP Dataset for Sentiment Analysis, Topic Classification, and Hate Speech Detection is a manually curated Bangla text dataset designed to support research on low-resource Natural Language Processing. The data has been collected from several well-known Bangla newspapers, including Prothom Alo, Jugantor, Kaler Kantho, and Bangladesh Pratidin, ensuring linguistic diversity and content reliability. Consistent preprocessing and labeling guidelines were applied to facilitate reproducible experimentation across multiple Bangla NLP classification tasks under both high-resource and low-resource learning settings. Dataset Overview: This dataset consists of three task-specific Bangla NLP datasets, each constructed with balanced class distributions: Sentiment Analysis Dataset Classes: Positive, Negative, Neutral Samples per class: 1000 Total samples: 3000 Topic Classification Dataset Classes: Bangladesh, International, Sports, Entertainment Samples per class: 1000 Total samples: 4000 Hate Speech Detection Dataset Classes: Hate, Non-Hate Samples per class: 1000 Total samples: 2000 All datasets are sentence-level, manually labeled, and preprocessed using a unified pipeline to ensure consistency across tasks and fair comparative evaluation. Applications and Motivation: This dataset supports a wide range of Bangla NLP applications, including sentiment analysis, topic classification, and hate speech detection. The primary motivation behind collecting this dataset is to enable few-shot learning research for Bangla, where large-scale labeled data is often unavailable. The balanced and task-diverse structure of the dataset makes it particularly suitable for evaluating data-efficient learning methods, such as few-shot learning and metric-based approaches. It can also be used for benchmarking supervised, few-shot, and low-resource NLP models for the Bangla language.
Files
Steps to reproduce
First, textual data were collected from several well-established Bangla newspapers, including Prothom Alo, Jugantor, Kaler Kantho, and Bangladesh Pratidin, covering a wide range of sections such as national news, international affairs, sports, entertainment, and social issues. From the collected articles, sentence-level text samples were extracted to enable fine-grained text classification. Next, a standardized preprocessing pipeline was applied. This included sentence segmentation, removal of unnecessary symbols and special characters, normalization of whitespace, and handling of Bangla-specific punctuation (such as the Bangla full stop). All preprocessing steps were implemented using Python-based scripts to ensure consistency and reproducibility. After preprocessing, the sentences were manually labeled according to the corresponding NLP task. Separate label sets were maintained for sentiment analysis, topic classification, and hate speech detection. Special care was taken to ensure balanced class distributions, with an equal number of samples collected for each class to avoid class imbalance issues. Finally, the processed and labeled data were stored in standardized CSV formats, and the same data collection, preprocessing, and labeling protocols were applied consistently across all tasks. This systematic workflow allows other researchers to easily understand, reproduce, and extend the dataset for further Bangla NLP research.
Institutions
- Daffodil International University