Depression & Mental Health Classification
Description
Is derived from a structured mental health and depression survey and contains 1,998 cleaned responses. It includes 21 demographic, lifestyle, behavioral, and psychological features, with the primary objective of supporting multi-class depression classification tasks. Each record is labeled with one of twelve clinically motivated depression types, making the dataset particularly suitable for supervised learning, explainable AI studies, and mental health analytics. šÆ Target Variable: Depression_Type The target variable is numerically encoded as follows: Code Depression Type (Academic Naming) 0 No clinically significant depression 1 Minimal / Mild depression 2 Moderate depression 3 Moderately-severe depression 4 Severe depression 5 Persistent depressive disorder (Dysthymia) 6 Seasonal affective pattern 7 Peripartum / Postpartum depression 8 Bipolar-related depressive episode 9 Situational / Reactive depression 10 Psychotic depression 11 Other specified depressive disorder These class names are aligned with commonly recognized categories in clinical and academic research and are suitable for use directly in scholarly publications. š§© Feature Variables and Encoding Scheme All categorical variables were numerically encoded to support statistical analysis and machine learning models. The encodings follow logical ordinal or nominal mappings, as outlined below: Gender: 0 = Male, 1 = Female Education_Level: 0 = Primary or below 1 = Secondary / High school 2 = Undergraduate 3 = Postgraduate or higher Employment_Status: 0 = Unemployed 1 = Student 2 = Employed 3 = Self-employed 4 = Other Symptoms: Encoded numerically to represent different symptom clusters (e.g., sleep disturbance, appetite loss, etc.) Low_Energy: 0 = No, 1 = Yes, 2 = Occasionally Low_SelfEsteem: 0 = No, 1 = Yes, 2 = Occasionally Search_Depression_Online: 0 = No, 1 = Yes Worsening_Depression: 0 = No, 1 = Yes Overeating Level: 0 = None 1ā4 = Mild 5ā8 = Moderate 9ā12 = Severe (Grouped for interpretability) Eating Frequency (per day): 0 = ā¤2 meals 1 = 3 meals 2 = 4ā5 meals 3 = >5 meals SocialMedia_WhileEating: 0 = Never 1 = Rarely 2 = Often 3 = Always Self_Harm: 0 = No history 1 = History of self-harm Mental_Health_Support: 0 = No 1 = Yes Suicide_Attempts: 0 = None 1 = Once 2 = Twice 3 = Three or more All variables were checked, and no missing values remain after preprocessing. āļø Data Preprocessing & Normalization Removed non-informative identifiers and irrelevant text fields Verified zero missing values across all 21 features Standardized continuous variables (e.g., Depression_Score, Sleep_Hours) using Z-score normalization Split data into 80% training and 20% testing Feature Selection Using the ANOVA F-test, the following top features were identified as most discriminative: Education_Level Employment_Status Symptoms Low_Energy Search_Depression_Online Worsening_Depression Eating Frequency SocialMedia_WhileEating Nervous_Level Mental_Health_Support
Files
Steps to reproduce
āļø Data Preprocessing & Normalization Removed non-informative identifiers and irrelevant text fields Verified zero missing values across all 21 features Standardized continuous variables (e.g., Depression_Score, Sleep_Hours) using Z-score normalization Split data into 80% training and 20% testing āļø Class Imbalance Handling Initially, the dataset showed significant class imbalance (e.g., Situational Depression = 627 samples, while some classes had fewer than 30). To address this, SMOTETomek was applied, combining oversampling and undersampling techniques. After balancing, each class contained approximately 600+ samples, ensuring fair and unbiased model learning. š¤ Modeling and Research Usage The dataset supports a wide range of applications, including: Multi-class depression classification Exploratory data analysis of mental health patterns Feature importance and explainability studies (LIME, SHAP) Early detection models for mental health support systems In our experiments, baseline models such as Logistic Regression, Random Forest, XGBoost, SVM, and MLP achieved accuracies between 88% and 98%. I sincerely hope this detailed explanation clarifies the dataset structure and encoding scheme. Please feel free to reach out if you need any further clarification or assistance. I would be happy to help.