ShadhuCholito-BN
Description
## Dataset Collection, Translation, and Preparation Process The dataset was created to support Bengali language style classification research, particularly for distinguishing between **Cholito Bangla** (colloquial Bengali) and **Sadhu Bangla** (classical/formal Bengali). Text data were collected from multiple Bengali literary sources, including novels, stories, and digital text-based books written by well-known Bengali authors. Initially, text was extracted from digital Bengali documents and converted into machine-readable format. Sentence segmentation techniques were applied to divide paragraphs into individual sentences using Bengali punctuation marks and newline patterns. Each sentence was then manually or semi-automatically labeled as either `cholito` or `sadhu` according to its linguistic structure and writing style. After extraction and labeling, multiple datasets from different sources were merged into a unified dataset structure containing two columns: `sentence` and `type`. Several preprocessing operations were performed to improve dataset quality, including removal of duplicate entries, elimination of empty rows, normalization of text formatting, and correction of inconsistent labels such as capitalization differences (e.g., “Cholito” vs “cholito”). To create an English version of the dataset, the Bangla sentences were translated into English using the `GoogleTranslator` module from the `deep-translator` library. A batch translation strategy was implemented to improve processing efficiency and reduce API requests by translating multiple sentences simultaneously. A retry mechanism was also incorporated to handle temporary API failures and connection issues during translation. The translated outputs were stored as a parallel English dataset while preserving their original labels. After completing the translation process, additional cleaning operations were performed to remove missing values, empty strings, translation errors, and duplicate entries from both Bangla and English datasets. The final datasets were then shuffled randomly to reduce source-wise ordering bias and improve data distribution. Finally, both the Bangla and English datasets were divided into training and testing subsets using a 70:30 ratio. The training datasets were prepared for machine learning model training, while the testing datasets were reserved for evaluation and performance analysis. All datasets were stored in CSV format using UTF-8 encoding to ensure proper multilingual text representation and compatibility across different NLP and machine learning frameworks.
Files
Institutions
- Chittagong University of Engineering & TechnologyChittagong, Chittagong