U2VDow30 : Dow 30 Stocks tweets for proposing User2Vec approach

Published: 4 April 2022| Version 2 | DOI: 10.17632/dc6gdcz7n9.2


This data set has been collected for "User2Vec: stock market prediction using deep learning with a novel representation of social network users" paper. Stock market prediction is an interesting and challenging problem for investors and financial analysts. Recently, recurrent neural networks like LSTM have shown good performance in the field of stock market prediction. Most current methods use historical market data and in some cases, the dominant direction of users and news for each day. In some cases, the opinions of social network members about the stocks are extracted to improve the prediction accuracy. Usually, the opinions of different users are treated in the same way and are given the same weights in these works. However, it is clear that these opinions have different values based on the accuracy of the prediction of the related user. In this study, the idea is to convert the opinion of each user about each stock into a vector (User2Vec) and then use these vectors to train a Recurrent Neural Network (RNN) and ultimately model the behavior of the users in the market. The proposed user representation is composed of the features extracted from the messages posted in a social network and the market data. Here, we consider the power of the user in predicting the future of the stock based on the social network metrics, e.g. the number of the followers of the user, and the accuracy of its previous predictions. This way, the number of training data is increased and the model is effectively learned. These data are then used to train a stacked bidirectional LSTM network used for aggregating the input data and providing the final prediction. Empirical studies of the proposed model on 30 stocks of 30 Dow Jones clearly shows the superiority of the proposed model over traditional representations. For example, the prediction accuracy is about 93% for the Apple stock which is much higher than the compared models.


Steps to reproduce

Data collection Below is a description of the raw data, extracted features, and the methods and tools used for collecting each. As stated before, each set of the features used in our proposed model is collected from a different source. The data collected in each step has a lot of fields but only the fields existing in the User2Vec feature set are used. Twitter is often known as a public platform and various APIs have been introduced to collect its data. In this work, we used GetOldTweets3, Tweepy, and Textblob to collect data and create new features. Some tools provide access to older tweets, and others have download restrictions. GetOldTweets3 is a completely free Twitter data gathering tool that also supports hybrid search and word search features, allowing you to access older tweets. This API has very useful information like id (str), permalink (str), username (str), to (str), text (str), date (DateTime) in UTC, retweets (int), favorites (int), mentions (str), hashtags (str), and geo (str). The features that GetOldTweets extracts are useful but do not contain crucial social information such as the number of followers and followings, so we use Tweepy to extract some other useful social features for each user. This powerful tool is also used to collect twitter data that uses the OAuth mechanism for authentication. The next tool is a Textblob that can extract the emotional tag of each tweet’s text. It also uses the OAuth authentication mechanism, just like the previous one. Data are collected for all stocks during 2018 and 2019. Due to the difference in the number of tweets per day for each stock, the total number of tweets collected over the two years varies for each share. https://pypi.org/project/GetOldTweets3/ https://pypi.org/project/tweepy/ https://pypi.org/project/textblob/


Amirkabir University of Technology Department of Computer Engineering and Information Technology


Social Network Analysis, Stock Exchange, Deep Learning, Sentiment Analysis