Dataset of tweets in English language about the COVID-19 pandemic for binary sentiment analysis

Published: 13 September 2021| Version 1 | DOI: 10.17632/6fx22vj6g6.1


This dataset is aimed to the task of sentiment analysis in tweets about the COVID-19 pandemic. There are 3 versions of the dataset, composed by 186,000, 132,000, and 82,000 tweets in English language with stopwords removal, respectively. Positive tweets have polarity equal to 1, while negative tweets have polarity equal to 0 in all versions. All datasets were selected, cleaned and organized from the public dataset available at <>. The datasets are accompanied by embedding matrices generated from the pre-trained Word2Vec shallow neural network available at <>.


Steps to reproduce

The data file contains 3 folders, each one with 3 files, namely: one .csv file for positive tweets, other .csv file for negative tweets, and one .npy file corresponding to the embedding matrix. The .csv files are organized into 4 columns with the following information: tokenized tweets with stopwords removal, polarity, tweets size, and vector representation of them.


Instituto Federal de Educacao Ciencia e Tecnologia do Espirito Santo


Natural Language Processing, Twitter, Sentiment Analysis, Word Embedding, COVID-19