Data for: An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit

Published: 17 Apr 2019 | Version 1 | DOI: 10.17632/85njyhj45m.1

Description of this data

Topic labelled online social network (OSN) data sets are useful to evaluate topic modelling and document clustering tasks. We provide three data sets with topic labels from two online social networks: Twitter and Reddit. To comply with Twitter’s terms and conditions, we only publish the tweet identifiers along with the topic label. The Reddit data is supplied with the full text and the topic label. The first Twitter data set was collected from the Twitter API by filtering for the hashtag #Auspol, used to tag political discussion tweets in Australia. The second Twitter data set was originally used in the RepLab 2013 competition and contains expert annotated topics. The Reddit data set consists of 40,000 Reddit parent comments from May 2015 belonging to 5 subreddit pages, which are used as topic labels.

Experiment data files

This data is associated with the following publication:

An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit

Published in: Information Processing and Management

Latest version

  • Version 1

    2019-04-17

    Published: 2019-04-17

    DOI: 10.17632/85njyhj45m.1

    Cite this dataset

    Curiskis, Stephan; Kennedy, Paul; Osborn, Thomas; Drake, Barry (2019), “Data for: An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit”, Mendeley Data, v1 http://dx.doi.org/10.17632/85njyhj45m.1

Statistics

Views: 185
Downloads: 31

Categories

Natural Language Processing, Machine Learning, Clustering, Social Networks

Licence

CC0 1.0 Learn more

The files associated with this dataset are licensed under a Public Domain Dedication licence.

What does this mean?

You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.

Report