Chattogram sent: A Multilingual Sentiment Dataset for Chattogram, Bengali , and English

Published: 8 May 2026| Version 2 | DOI: 10.17632/k6hts2ktxw.2
Contributors:
,
,
,

Description

ChattogramSent is a pioneering, high-quality, and manually curated sentiment analysis dataset for the Chattogram dialect (Chittangga), a major underrepresented oral language spoken in southeastern Bangladesh. This corpus marks the first significant effort to create a digital benchmark for this dialect, which traditionally lacks a standardized writing system and high-quality computational resources. Developed entirely by a team of native researchers, the dataset bridges the gap between oral tradition and modern NLP by providing a meticulously cleaned and phonetically transcribed corpus in Bengali script. Data Composition and Scale The dataset comprises 7,052 unique samples, each manually annotated for sentiment and verified for linguistic accuracy. Unlike automated datasets, every entry in this corpus has been reviewed by native speakers to ensure cultural and contextual relevance. 1. Sentiment Distribution (Class Balance) The dataset is categorized into three distinct sentiment classes: Neutral: 4,287 samples (The primary baseline for objective dialectal speech) Negative: 1,600 samples (Capturing regional expressions of dissatisfaction or criticism) Positive: 1,165 samples (Reflecting appreciative and affirmative dialectal nuances) 2. Source of Data (Multi-Domain Coverage) To ensure the model's robustness, data was harvested from three authentic domains: Drama (3,292 samples): Scripts from regional Chittagonian dramas, rich in idiomatic expressions and emotional depth. Conversation (2,568 samples): Real-world everyday dialogues capturing the natural flow of the dialect. Social Media (1,192 samples): Modern digital interactions, providing insights into how the dialect is adapted for social platforms. Key Highlights for Researchers First of its Kind: The first comprehensive benchmark for sentiment analysis in the Chattogram dialect. Native-Led Annotation: 100% manual annotation by native experts, eliminating the errors common in machine-translated or non-native datasets. Multi-Domain Diversity: Includes data from entertainment (drama), social media, and interpersonal speech, making it ideal for training versatile NLP models. Phonetic Accuracy: Provides a standardized phonetic transcription in Bengali script, essential for training speech-to-text and sentiment classifiers. Potential Use Cases This dataset serves as a foundational resource for: Developing Sentiment Classifiers for low-resource regional languages. Fine-tuning Transformer-based models (like BERT or RoBERTa) for dialectal understanding. Linguistic Research into the emotional semantics of the Chattogram dialect. Enhancing Multilingual AI systems to support regional Bangladeshi languages.

Files

Steps to reproduce

1.Data Acquisition and Domain Selection: Systematically collect raw text data from three diverse domains: Regional Drama scripts (3,292 samples), everyday human conversations (2,568 samples), and Social Media interactions (1,192 samples) to ensure high linguistic variety. Focus data collection on the Chattogram dialect (Chittangga) as spoken in southeastern Bangladesh. 2.Linguistic Processing and Phonetic Transcription: As the Chattogram dialect is primarily oral, perform phonetic transcription of the collected speech and text into the Bengali script. Conduct manual text normalization to handle the lack of a standardized writing system and ensure consistent script usage. 3.Cross-Lingual Alignment: Align each Chattogram sentence with its semantic equivalent in Standard Bangla and English through a native-driven translation-first pipeline. Verify the semantic fidelity of the translations using researchers who are native speakers of both the dialect and the standard language. 4.Manual Sentiment Annotation: Employ a team of native speakers to manually annotate the sentiment of all 7,052 samples into three categories: Neutral (4,287), Negative (1,600), and Positive (1,165). Ensure label consistency by performing a secondary review of the annotations to eliminate cultural or contextual bias. 5.Data Cleaning and Consolidation: Apply Python-based preprocessing (using the Pandas library) to merge separate files into a unified corpus (csv). Standardize all labels to lowercase and remove redundant whitespaces or formatting inconsistencies to ensure the dataset is ready for machine learning tasks. 6.Benchmarking and Usage: For experimental reproduction, split the cleaned dataset into training, validation, and testing sets (e.g., 70:15:15). Utilize standard NLP evaluation metrics such as Accuracy, F1-score, and Precision to benchmark the performance of dialectal sentiment classifiers.

Categories

Computational Linguistics, Regional Studies, Natural Language Processing, Text Mining, Sentiment Analysis, Low-Resource LLM

Licence