Sentiment, Emotion analysis for mental health based on text
Description
This dataset contains 160,000 social media text records curated and customized from two distinct sources to facilitate advanced emotion recognition and sentiment analysis tasks. The text entries capture nuanced human emotions and psychological states expressed on digital platforms. Each record is categorized into one of ten specific classes, spanning standard emotions, general sentiments, and critical psychological states like depression. This curated dataset is highly suitable for training machine learning models, deep learning architectures (e.g., BERT, RoBERTa, BiLSTM), and ensemble meta-classifiers.
Files
Steps to reproduce
This dataset was created through a process of data integration, curation, and filtration from two distinct social media textual datasets. To reproduce or replicate this dataset, follow these sequential steps: 1. Data Sourcing & Collection: - Identify and retrieve two independent raw datasets containing social media text (such as tweets, posts, or comments) labeled with emotions and mental health indicators. 2. Data Merging & Formatting: - Load both datasets into a data processing environment (e.g., using Python Pandas or R). - Standardize the column names across both datasets to 'Text' (for the social media posts) and 'Emotion' (for the target labels). - Concatenate/merge the two datasets into a single unified dataframe. 3. Label Standardization & Curation: - Analyze the target labels from both original sources. - Map and harmonize overlapping or synonymous labels into 10 distinct categorical classes: 'love', 'happiness', 'sadness', 'Normal', 'hate', 'anger', 'Depression', 'fun', 'surprise', and 'worry'. - Filter out any records with ambiguous, corrupted, or irrelevant emotional labels to maintain dataset integrity. 4. Data Cleaning & Deduplication (Raw Text Preservation): - Remove any duplicate text entries to avoid data leakage during model training. - Drop rows with missing values (NaN) in either the 'Text' or 'Emotion' columns. - Keep the social media text in its completely RAW format (preserving original spelling, punctuation, and linguistic nuances) without applying heavy preprocessing, tokenization, or stopword removal, making it ideal for deep learning architectures like BERT/RoBERTa. 5. Final Export: - Reset the dataframe index. - Export the final curated corpus of 160,000 records into a single comma-separated values file named "Emotion_Sentiment_DataSet.csv".
Institutions
- National University BangladeshDhaka Division, Dhaka