Data for: An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit

Published: 17 April 2019| Version 1 | DOI: 10.17632/85njyhj45m.1
Stephan Curiskis, Paul Kennedy, Thomas Osborn, Barry Drake


Topic labelled online social network (OSN) data sets are useful to evaluate topic modelling and document clustering tasks. We provide three data sets with topic labels from two online social networks: Twitter and Reddit. To comply with Twitter’s terms and conditions, we only publish the tweet identifiers along with the topic label. The Reddit data is supplied with the full text and the topic label. The first Twitter data set was collected from the Twitter API by filtering for the hashtag #Auspol, used to tag political discussion tweets in Australia. The second Twitter data set was originally used in the RepLab 2013 competition and contains expert annotated topics. The Reddit data set consists of 40,000 Reddit parent comments from May 2015 belonging to 5 subreddit pages, which are used as topic labels.



Natural Language Processing, Machine Learning, Clustering, Social Networks