Dataset of tweets in English language about the COVID-19 pandemic for binary sentiment analysis
This dataset is aimed to the task of sentiment analysis in tweets about the COVID-19 pandemic. There are 3 versions of the dataset, composed by 186,000, 132,000, and 82,000 tweets in English language with stopwords removal, respectively. Positive tweets have polarity equal to 1, while negative tweets have polarity equal to 0 in all versions. All datasets were selected, cleaned and organized from the public dataset available at <https://ieee-dataport.org/open-access/coronavirus-covid-19-tweets-dataset>. The datasets are accompanied by embedding matrices generated from the pre-trained Word2Vec shallow neural network available at <https://data.mendeley.com/datasets/t8bxg423yk/1>.
Steps to reproduce
The data file contains 3 folders, each one with 3 files, namely: one .csv file for positive tweets, other .csv file for negative tweets, and one .npy file corresponding to the embedding matrix. The .csv files are organized into 4 columns with the following information: tokenized tweets with stopwords removal, polarity, tweets size, and vector representation of them.