ChattoBan: A Benchmark Dataset for Language Identification Between Bengali and Chittagonian Dialects

Published: 19 November 2025| Version 1 | DOI: 10.17632/mfsg573r9t.1
Contributors:
,
,

Description

Chittagonian is one of the most widely spoken native languages in Bangladesh, with an estimated 14 million speakers across the country and abroad. Although Bengali is the national language, Chittagonian differs significantly in phonology, vocabulary, and grammar. These linguistic differences make automatic language identification an important task for NLP applications such as machine translation, language detection, and sentiment analysis. To address the scarcity of Chittagonian language resources, we introduce ChattoBan, a benchmark dataset designed for sentence-level identification between Bengali and Chittagonian. The dataset contains 6,151 annotated sentences, categorized as follows: Chittagonian: 2,650 sentences Bengali: 3,501 sentences Chittagonian sentences were collected from social media platforms (Facebook, Twitter), Chittagonian news articles, song lyrics, and direct contributions from native speakers. Bengali sentences were sourced from various Bengali newspapers and classical literature to ensure authentic and diverse language representation. To ensure annotation reliability, two native Chittagonian speakers and one native Bengali speaker independently reviewed and validated all sentence labels. Additionally, preprocessing steps such as duplicate removal, punctuation removal, and English character and number filtering were applied to enhance data quality while preserving linguistic authenticity. The ChattoBan dataset has significant implications across multiple NLP and AI domains, including: Language identification for closely related languages Machine translation and code-switching analysis Supervised and semi-supervised learning Sociolinguistic and dialect studies Bangla-centric NLP research and educational applications The ChattoBan dataset is openly available for academic and research purposes, promoting collaboration and innovation within the Bangla NLP community. By providing a reliable benchmark for Bengali–Chittagonian identification, this dataset aims to support future advancements in low-resource language processing.

Files

Steps to reproduce

Source: Collect Chittagonian sentences from public social media posts (Facebook, Twitter), Chittagonian news portals, song lyrics, and native speakers. Collect Bengali sentences from newspapers, online articles, and literature sources. Collection Method: Use manual extraction or web-scraping tools (e.g., Selenium, BeautifulSoup, Requests) while following platform policies, ethical guidelines, and privacy rules. Native-speaker contributions should be collected with consent. Data Format: Store all sentences in a structured format (e.g., CSV, JSON, XLSX) with fields such as sentence text, language label (Bengali/Chittagonian), and source type.

Institutions

  • Daffodil International University

Categories

Natural Language Processing, Machine Learning, Language Identification, Sentence Processing

Licence