Semantically Analyzed Metadata of Tumblr Posts and Bloggers

Published: 16 Apr 2016 | Version 1 | DOI: 10.17632/hd3b6v659v.1

Description of this data

The dataset "" is the first ever published dataset on Tumblr. It contains the Tumblr metadata of posts and bloggers collected via bootstrapping method. The dataset also contains various features extracted after semantically analyzing the textual post.

Dataset Description

The dataset contains three files: Tumblr.sql, semtags.txt and a README file. Tumblr.sql creates 8 tables in mysql primarily named as blogger, blogger_desc, Document_Sentiment_Feature, Post_desc, Posts, Semantic_Tagging, Tone and Topic_Classification.

semtags.txt is a lexicon of tags/cods used for semantic tagging of each post. This list is created by USAS (UCREL Semantic Analysis System). Followings are the list and description of all attributes and tables used in the dataset. Same attributes used in different tables are listed only once.

  1. Table- Posts, Post_desc

Post_ID- unique id of each post
Timestamp- Timestamp of when the post was created
gmt- GMT timestamp of each post
blogger- unique id of the author of the post
url- short url to original Tumblr post
tags- tags/keywords associated with the posts.
num_tags- number of tags in each post
type- type of a post (Text, Quote, Chat..)
notes- number of notes (like + reblog) on a post
rebloggedfrom- id of the blogger from which profile the post was retrieved. Null if the post is originally created by 'blogger'.
title- title of the post
Desc- description or body content of the post.

  1. Table- blogger, blogger_desc

blogger_id- unique id of a blogger
ask- if users allows asking question on his profile
ask_anon- if users allows anonymous questions from other bloggers
like_count- number of like count on blogger's page
post_count- number of posts made by the blogger (including re-blogged posts)
title- title of blogger's page
desc- description of blogger

  1. Table- Document_Sentiment_Feature

score- sentiment score of a post
label- label of a sentiment based on the score value

  1. Table- Tone

Emotion- label and confidence score of a emotion tone in a post (joy, fear, sadness...)
Writing- label and confidence score of a writing tone in a post (analytical, confident..)
Social- label and confidence score of a social tone in a post (openness, Conscientiousness)

  1. Topic_Classification

lang_post- language of a post (English, Arabic, German, Italian..)
taxonomy- topics being discussed/mentioned in the Post.
Class- label of a post assigned by our classifier.

  1. Semantic_Tagging

Tagged_Posts- contains the original text post encoded with semantics tags (codes available in semtags.txt file)

Experiment data files

Steps to reproduce

mysql -u root -p;
enter your password
create database Tumblr Apr12; use dblp;
source filename.sql;

Related links

Latest version


Views: 429
Downloads: 27

Previous versions

  • Version 1


    Published: 2016-04-16

    DOI: 10.17632/hd3b6v659v.1

    Cite this dataset

    Agarwal, Swati; Sureka, Ashish (2016), “Semantically Analyzed Metadata of Tumblr Posts and Bloggers”, Mendeley Data, v1

Compare to version


Applied Sciences, Semantics, Social Issues, Data Mining, Social Media, Patient Social Context, Intelligent Information Retrieval


CC BY 4.0 Learn more

The files associated with this dataset are licensed under a Creative Commons Attribution 4.0 International licence.

What does this mean?

This dataset is licensed under a Creative Commons Attribution 4.0 International licence. What does this mean? You can share, copy and modify this dataset so long as you give appropriate credit, provide a link to the CC BY license, and indicate if changes were made, but you may not do so in a way that suggests the rights holder has endorsed you or your use of the dataset. Note that further permission may be required for any content within the dataset that is identified as belonging to a third party.