Semantically Analyzed Metadata of Tumblr Posts and Bloggers

Published: 22-04-2016| Version 2 | DOI: 10.17632/hd3b6v659v.2
Swati Agarwal,
Ashish Sureka


The dataset "" is the first ever published dataset on Tumblr. It contains the Tumblr metadata of posts and bloggers collected via bootstrapping method. The dataset also contains various features extracted after semantically analyzing the textual post. Dataset Description The dataset contains three files: Tumblr.sql, semtags.txt and a README file. Tumblr.sql creates 8 tables in mysql primarily named as blogger, blogger_desc, Document_Sentiment_Feature, Post_desc, Posts, Semantic_Tagging, Tone and Topic_Classification. semtags.txt is a lexicon of tags/cods used for semantic tagging of each post. This list is created by USAS (UCREL Semantic Analysis System). Followings are the list and description of all attributes and tables used in the dataset. Same attributes used in different tables are listed only once. 1. Table- Posts, Post_desc Post_ID- unique id of each post Timestamp- Timestamp of when the post was created gmt- GMT timestamp of each post blogger- unique id of the author of the post url- short url to original Tumblr post tags- tags/keywords associated with the posts. num_tags- number of tags in each post type- type of a post (Text, Quote, Chat..) notes- number of notes (like + reblog) on a post rebloggedfrom- id of the blogger from which profile the post was retrieved. Null if the post is originally created by 'blogger'. title- title of the post Desc- description or body content of the post. 2. Table- blogger, blogger_desc blogger_id- unique id of a blogger ask- if users allows asking question on his profile ask_anon- if users allows anonymous questions from other bloggers like_count- number of like count on blogger's page post_count- number of posts made by the blogger (including re-blogged posts) title- title of blogger's page desc- description of blogger 3. Table- Document_Sentiment_Feature score- sentiment score of a post label- label of a sentiment based on the score value 4. Table- Tone Emotion- label and confidence score of a emotion tone in a post (joy, fear, sadness...) Writing- label and confidence score of a writing tone in a post (analytical, confident..) Social- label and confidence score of a social tone in a post (openness, Conscientiousness) 5. Topic_Classification lang_post- language of a post (English, Arabic, German, Italian..) taxonomy- topics being discussed/mentioned in the Post. Class- label of a post assigned by our classifier. 6. Semantic_Tagging Tagged_Posts- contains the original text post encoded with semantics tags (codes available in semtags.txt file)


Steps to reproduce

mysql -u root -p; enter your password create database Tumblr Apr12; use dblp; source filename.sql;