Depression Indicators in Twitter

Published: 5 March 2025| Version 1 | DOI: 10.17632/s25h5tzgyf.1
Contributor:
Ataíde Gualberto

Description

The dataset was created to identify relevant features for detecting individuals with depression based on their Twitter posts. It consists of 3,758 tweets and 5,902 unique words, structured in a binary matrix format where each row represents a tweet and each column represents a word. The values indicate the presence (1) or absence (0) of a word in a given tweet. In addition to textual data, the dataset incorporates nontextual features, stored in a separate table. Each row represents a tweet, and each column corresponds to a specific attribute: the number of likes, retweets, mentions, and the time of publication, as well as the device used for posting. The posting time was transformed into a numerical format ranging from 0 to 47, where each value represents a 30-minute interval throughout the day. In contrast, the device type is stored as raw text containing the name of the device used to post each tweet. The numerical values (likes, retweets, and mentions) were also kept as raw counts, preserving their original scale for further analysis. This dataset was used in the study "Characteristics for depression detection using Twitter data" (DOI: 10.59681/2175-4411.v16.iEspecial.2024.1319).

Files

Steps to reproduce

The dataset was built by collecting tweets from Brazilian users who clearly mentioned that they had been diagnosed with depression. First, a Python script was written and run in Visual Studio Code. This script used a tool called snscrape to search Twitter for posts containing words like “diagnosticado” and “depressão” between March 11, 2020, and October 16, 2022. In addition to the tweet text, various details such as the date and time of the post, hashtags, number of likes, retweets, and the type of device used were also collected. Once the data was gathered, it originally appeared as a set of dictionaries. To make the data easier to work with, the information was converted into a table format using the Pandas library and then saved as a CSV file. A team of four researchers then manually reviewed all the tweets. Their job was to remove any false positives—tweets that were about someone else, jokes, or other irrelevant content—and only include posts that clearly indicated a personal diagnosis. We also made sure to keep just one tweet per user to balance the dataset. After filtering, the text data went through a cleaning process where all letters were converted to lowercase, and accents, punctuation, special characters, emojis, user mentions, and links were removed. A stemming process was also applied to reduce words to their basic form. In parallel, numerical representations were created by converting the tweets into a binary matrix, where each row represents a tweet and each column indicates whether a unique word is present. Throughout this process, ethical considerations were maintained by anonymizing user IDs and ensuring that only publicly available data was used, in line with Twitter's guidelines.

Institutions

Instituto Federal de Educacao Ciencia e Tecnologia de Sergipe, Universidade Federal de Sergipe

Categories

Depression, Pattern Recognition in Bioinformatics

Funding

Coordenação de Aperfeicoamento de Pessoal de Nível Superior

88887.712271/2022-00

Licence