Bangla Mental Health Dataset V2

Published: 21 April 2026| Version 2 | DOI: 10.17632/23tcfxkgc2.2
Contributors:
Esfer Sami,

Description

This dataset is a synthetically generated Bangla-language mental health dataset consisting of 10,000 structured conversational samples. It is designed to support research in natural language processing (NLP), particularly for low-resource languages such as Bangla, with applications in large language model (LLM) fine-tuning, mental health text classification, and dialogue system development. Each sample follows an instruction-based format (input–instruction–output), making the dataset directly suitable for supervised fine-tuning (SFT), Alpaca-style training, and parameter-efficient methods such as LoRA and QLoRA. The dataset captures a diverse range of approximately 40 mental health-related conditions, including stress, anxiety, overthinking, lack of emotional support, and self-confidence issues, expressed in natural Bangla conversational patterns. The dataset is fully synthetic and was generated using controlled text generation pipelines informed by mental health literature, psychological reports, media discussions, and publicly available educational content. No real user data or personally identifiable information (PII) is included. This dataset is intended strictly for research and educational purposes. It is not suitable for clinical use, diagnosis, or real-world mental health decision-making. The resource aims to facilitate safe and reproducible experimentation in Bangla NLP and conversational AI.

Files

Steps to reproduce

The dataset was generated using a controlled synthetic data generation pipeline. 1. Mental health topics were identified based on literature, reports, and public discussions. 2. Prompt templates were designed to simulate conversational scenarios in Bangla. 3. Text samples were generated using large language models with controlled instructions. 4. Generated outputs were filtered and refined to ensure linguistic consistency and relevance. 5. Data was structured into instruction–input–output format suitable for supervised fine-tuning (SFT). No real user data was used at any stage. All content is synthetic and created for research purposes.

Categories

Artificial Intelligence, Mental Health, Natural Language Processing, Behavioral Psychology

Licence