Chattogram sent: A Multilingual Sentiment Dataset for Chattogram, Bengali , and English (Versions 2)
Description
The Chattogram dialect (Chittangga), widely spoken in southeastern Bangladesh, is primarily an oral language with no standardized writing system. Despite its large speaker population, the dialect remains underrepresented in computational linguistics due to the scarcity of high-quality, manually curated digital resources. This dataset introduces a fully manual, native-curated multilingual sentiment corpus developed entirely by researchers who are native speakers of the Chattogram dialect. It consists of 4,451 parallel sentences aligned across five distinct columns: Standard Bangla, Chattogram dialect, English, Sentiment labels, and the Source of Data. The inclusion of the 'Source of Data' column provides essential context by categorizing each entry based on its origin, such as social media posts, regional drama scripts, and everyday conversations. The Chattogram dialect is predominantly spoken in Chattogram city, Cox’s Bazar, and the coastal regions of the Chittagong Hill Tracts, as well as nearby districts of southeastern Bangladesh. Given the oral nature of the dialect, all Chattogram sentences were phonetically transcribed into Bengali script. The dataset follows a translation-first pipeline: each Chattogram sentence was translated into Standard Bangla and then English by the same native speakers to maintain semantic fidelity and cross-lingual alignment. Sentiment annotation was performed after multilingual alignment, with each sentence categorized as Neutral, Negative, or Positive (Neutral: 1,969; Negative: 1,467; Positive: 1,015). The dataset represents the first high-quality benchmark for sentiment analysis in the Chattogram dialect, enabling researchers to develop low-resource NLP models, dialectal sentiment classifiers, and cross-lingual transformer-based systems. Its native-driven design ensures linguistic authenticity, cultural accuracy, and contextual relevance, providing a valuable resource for the computational study of underrepresented languages. By combining manual transcription, expert multilingual translation, source-based categorization, and careful sentiment annotation, this corpus supports both academic research and practical applications in natural language processing, multilingual AI systems, and digital preservation of oral language traditions.
Files
Steps to reproduce
Accessing the Dataset: Download the provided CSV file ('Chattogram_sent_A_Multilingual_Sentiment_Dataset_for_Chattogram(Version 2).csv') from the Mendeley Data repository. Environment Setup: Open the file in any standard spreadsheet software (like Excel) or a programming environment (such as Python using Pandas or R) using UTF-8 encoding to ensure Bengali characters are displayed correctly. Data Structure Identification: Familiarize yourself with the five-column structure: Bengali, Chattogram, and English for multilingual text. Sentiment for the target labels (Positive, Negative, Neutral). Source of Data (Conversation, Drama, Social Media) for contextual or domain-specific analysis. Data Pre-processing: Perform necessary NLP pre-processing steps, such as tokenization, removing special characters, and handling the phonetic transcription of the Chattogram dialect. Cross-Lingual Mapping: Utilize the aligned parallel sentences across the three languages to develop cross-lingual sentiment classifiers or machine translation models. Model Training: Split the 4,451 entries into training, validation, and testing sets (e.g., a 70/15/15 ratio). The added 'Source of Data' column can be used for stratified sampling to ensure representation from all sources. Evaluation: Train supervised machine learning or deep learning models (like BERT or LSTM) and evaluate their performance using standard metrics such as Accuracy, Precision, Recall, and F1-score. Citation: Ensure proper academic attribution by citing this dataset in any research publications as per the provided citation format on Mendeley Data.