Synthetic and Real Key-Value Data Sets

Name: Synthetic and Real Key-Value Data Sets
Creator: Hyuk-Yoon Kwon
Published: 2020-02-11T05:33:41.001Z
Keywords: Big Data, Social Network Analysis, Geographic Location, Information, Twitter

Kwon, Hyuk-Yoon

doi:10.17632/kxcb3tnr3t.2

Synthetic and Real Key-Value Data Sets

Published: 11 February 2020| Version 2 | DOI: 10.17632/kxcb3tnr3t.2

Contributor:

Hyuk-Yoon Kwon

Description

We present key-value data sets where each data set is composed of various data types. We present eight datasets including synthetic and real data sets for storing them in the key-value stores such as LevelDB of Google, RocksDB of Facebook, and Berkeley DB of Oracle. The key-value stores have a strength that can deal with various data types by assigning data of an arbitrary type as the value and the unique ID as the key. When we construct key-value data sets, we focus on various data types (i.e., variety) in real data sets and various sizes (i.e., volume) in synthetic data sets. We generate four synthetic data sets according to the various size of data set: (1) KVData1, (2) KVData2, (3) KVData3, and (4) KVData4. The total number of objects are varied from 10K to 10M. For each key-value pair, we generate a random string with a variable length and a unique ID for a key. For real datasets, we crawled user tweets and relevant information from Twitter using Tweepy library (https://www.tweepy.org/) and each data set consists of various data types: 1) Geo-location, 2) hashtag, 3) Tweets, and 4) the number of followers. That is, all the data sets are designed to have different data types such as geo-locations, texts, and integers. Table 2 shows the characteristics of the real data sets. We crawled four kinds of real data sets: (1) ID-Geo, consisting of the tweet ID and the location information of the tweet, (2) ID-Hashtag, consisting of the tweet ID and the hashtags in the tweet, (3) ID-Tweet data set, consisting of the tweet ID and the tweet text, and (4) User-Followers, consisting of the user ID and the number of followers of the user.

Synthetic and Real Key-Value Data Sets

Description

Files

Categories

Licence