PHQ-9 Annotated Social Media Posts Dataset for Automated Depression Severity Assessment

Published: 11 February 2026| Version 1 | DOI: 10.17632/fs2r7zvmj3.1
Contributors:
Md Abdullah Ibne Aziz Abdullah,

Description

1. Overview This dataset contains a structured collection of social media posts annotated using the Patient Health Questionnaire-9 (PHQ-9) framework to support automated depression severity assessment through Natural Language Processing techniques. It has been developed to facilitate computational modeling of depression symptoms and severity levels based on textual expressions shared on publicly accessible platforms. The dataset is designed to support research in computational mental health, affective computing, and machine learning-driven psychological assessment. 2. Data Source and Collection The textual data were collected from publicly accessible social media platforms. Only openly available posts were included, and no private messages or restricted content were accessed. During the collection process, all identifying information such as usernames, profile identifiers, and direct metadata were removed to ensure anonymity. The resulting dataset consists solely of cleaned textual content prepared for research purposes. 3. Annotation Framework Each post was annotated according to the nine clinical criteria of the PHQ-9 instrument. These criteria include reduced interest or pleasure, depressed mood, sleep disturbances, fatigue, appetite changes, feelings of guilt or worthlessness, concentration difficulties, psychomotor changes, and suicidal ideation. For each entry, symptom presence and intensity were evaluated. Individual symptom indicators were recorded, and a cumulative PHQ-9 score was calculated. Based on this total score, each post was categorized into standard severity levels: minimal, mild, moderate, moderately severe, or severe depression. 4. Data Structure The dataset is provided in CSV format and includes structured variables such as a unique anonymized identifier, cleaned textual content, nine symptom-level indicators, total PHQ-9 score, categorical severity label, and numerical encoding where applicable. 5. Preprocessing Procedures To ensure data quality and consistency, several preprocessing steps were applied. These include text normalization, removal of URLs and special characters, elimination of duplicate entries, and handling of missing values. No synthetic or artificially generated text has been included; all entries are derived from authentic social media content. 6. Research Applications The dataset can be utilized for a variety of research purposes, including depression severity classification, symptom-level prediction, traditional machine learning modeling, deep learning approaches, transformer-based language models, and explainable artificial intelligence in mental health research. It is intended strictly for academic and research use. Researchers are expected to follow ethical guidelines and institutional policies when using the dataset.

Files

Categories

Psychology, Depression, Social Media, Mental Health, Anxiety, Natural Language Processing, Machine Learning, Emotional Stress, Bangladesh

Licence