Secondary data from Brazilian tweets about COVID-19

Published: 29 June 2021| Version 1 | DOI: 10.17632/kv52cwvskc.1
Fernando Xavier,


These datasets contain secondary data extracted from Brazilian tweets about the COVID-19 pandemic. These data were generated between March, 2020 and May, 2021. Due to the Twitter' restrictions , it was not possible to share the raw data. Also, respecting the users privacy and in accordance with the recommendations of the Brazilian Research Ethics Committee, neither tweets ID can be shared here. So, a set of scripts were applied on raw data to extract useful information and create these datasets for further research. The files initiated with top prefix contain the most cited words by day in three categories: general tweets, vaccine-related tweets and verified accounts tweets. Files starting with the prefix subject, on the other hand, contain the daily count of mentions according to the following categories: symptoms, drugs, vaccines, brands or vaccine manufacturers and the count per day. As the data are associated with the post date, studies can be developed considering the temporal aspect in order to compare the perception of users on a given subject over time. It is important to note that some issues have become more important over time, especially in relation to vaccines.


Steps to reproduce

Data were collected through the Twitter API and stored in a MongoDB database. After that, the text of the posts as well as account type information (verified or not) were extracted into daily files. To generate the secondary data, bash scripts were used as well as Python scripts. The scripts used can be found at As posts were collected in real time, many of the posts collected may no longer be available due to factors such as: 1) account removed by user; 2) post removed by user; 3) account suspended by twitter; 4) account made private by user


Universidade de Sao Paulo


Epidemiology, Social Media, Vaccine, Twitter, COVID-19