Chattogram sent: A Multilingual Sentiment Dataset for Chattogram, Bengali , and English

Published: 15 December 2025| Version 1 | DOI: 10.17632/k6hts2ktxw.1
Contributors:
,
,
,

Description

The Chattogram dialect (Chittangga), widely spoken in southeastern Bangladesh, is primarily an oral language with no standardized writing system. Despite its large speaker population, the dialect remains underrepresented in computational linguistics due to the scarcity of high-quality, manually curated digital resources. This dataset introduces a fully manual, native-curated multilingual sentiment corpus developed entirely by researchers who are native speakers of the Chattogram dialect. It consists of 4,452 parallel sentences aligned across Chattogram, Standard Bangla, and English. The data were collected from authentic sources, including social media posts, regional drama scripts, and everyday conversations, ensuring natural and context-rich language usage. The Chattogram dialect is predominantly spoken in Chattogram city, Cox’s Bazar, and the coastal regions of the Chittagong Hill Tracts, as well as nearby districts of southeastern Bangladesh. Given the oral nature of the dialect, all Chattogram sentences were phonetically transcribed into Bengali script. The dataset follows a translation-first pipeline: each Chattogram sentence was translated into Standard Bangla and then English by the same native speakers to maintain semantic fidelity and cross-lingual alignment. Sentiment annotation was performed after multilingual alignment, with each sentence categorized as Neutral, Negative, or Positive (Neutral: 1,969; Negative: 1,467; Positive: 1,016). The dataset represents the first high-quality benchmark for sentiment analysis in the Chattogram dialect, enabling researchers to develop low-resource NLP models, dialectal sentiment classifiers, and cross-lingual transformer-based systems. Its native-driven design ensures linguistic authenticity, cultural accuracy, and contextual relevance, providing a valuable resource for the computational study of underrepresented languages. By combining manual transcription, expert multilingual translation, and careful sentiment annotation, this corpus supports both academic research and practical applications in natural language processing, multilingual AI systems, and digital preservation of oral language traditions.

Files

Steps to reproduce

1. Download the CSV file from Mendeley Data. 2. Open the CSV in a spreadsheet software or programming environment. 3. Use the 'Sentiment' column as labels for supervised sentiment analysis tasks. 4. Preprocess text as needed (tokenization, normalization, script handling). 5. Leverage aligned Chattogram–Bengali–English sentences for cross-lingual modeling. 6. Split data for training, validation, and testing (e.g., 70/15/15) or use k-fold cross-validation. 7. Evaluate models with standard metrics (Accuracy, Precision, Recall, F1-score). 8. Cite the dataset in publications as provided.

Institutions

International Islamic University Chittagong

Categories

Computational Linguistics, Regional Studies, Natural Language Processing, Text Mining, Sentiment Analysis, Low-Resource LLM

Licence