ShadhuCholito-BN
Description
**Dataset Description** This dataset contains parallel Bangla and English sentence pairs collected and processed for language translation and linguistic research. The Bangla corpus includes sentences written in both **Cholito Bangla** (colloquial form) and **Sadhu Bangla** (classical/formal form), extracted from Bangla literary sources. Each sentence is labeled according to its linguistic style and paired with an English translation. The dataset consists of the following fields: * **sentence**: The sentence text (Bangla or English, depending on the dataset version). * **label**: Linguistic style label (`cholito` or `sadhu`). * **source**: Source category of the text (`Book`). Data preprocessing included sentence segmentation using Bangla punctuation markers, removal of noisy OCR artifacts, duplicate elimination, and dataset validation. The English version was generated through machine translation and aligned with the original Bangla sentences. The dataset is intended for research and development in: * Machine Translation (Bangla–English) * Natural Language Processing (NLP) * Text Classification * Style Transfer * Bangla Linguistic Analysis * Large Language Model (LLM) Training and Evaluation The dataset was randomly shuffled and split into training and testing subsets using a 70:30 ratio while preserving sentence alignment between the Bangla and English versions.
Files
Institutions
- Chittagong University of Engineering & TechnologyChittagong, Chittagong