Skip to main content

Big Data Research

ISSN: 2214-5796

Visit Journal website

Datasets associated with articles published in Big Data Research

Filter Results
1970
2024
1970 2024
6 results
  • Data for: Joint Contour Net Analysis for Feature Detection in Lattice Quantum Chromodynamics Data
    The provided files contain the data used in this case study. "config.b190k1680mu0700j02s16t08" contains the raw configuration data as (binary) output from the FORTRAN 'cooling' code. "topological charge density.hvol" captures the scalar field in 4-dimensions by computing the topological charge density at each site on the lattice. "cool0030_sliced.7z" contains each 3D slice of the data across the temporal axis. Read me files are provided for parsing the scalar field in 4D ("data_structure.txt") and 3D ("sliced_data_structure.txt").
    • Dataset
  • Data for: Classification of large DNA methylation datasets for identifying cancer drivers
    The supplementary data S1 containing the extracted features and genes for the different analyzed tumors.
    • Dataset
  • Data for: Towards Sustainable Smart City by Particulate Matter Prediction using Urban Big Data, Excluding Expensive Air Pollution Infrastructures
    It is vital to capture and analyze, from various sources in smart cities, the data that are beneficial in urban planning and decision making for governments and individuals. Urban policy makers can find a suitable solution for urban development by using the opportunities and capacities of big data, and by combining different heterogeneous data resources in smart cities. This paper presents data related to urban computing with an aim of assessing the knowledge that can be obtained through integration of multiple independent data sources in Smart Cities. The data contains multiple sources in the city of Aarhus, Denmark from August 1, 2014 to September 30, 2014. The sources include land use, waterways, water barriers, buildings, roads, amenities, POI, weather, traffic, pollution, and parking lot data. The published data in this paper is an extended version of the City Pulse project data to which additional data sources collected from online sources have been added.
    • Dataset
  • Data for: Selective and recurring re-computation of Big Data analytics tasks: insights from a Genomics case study
    The data includes runtime information on the re-computation of the SVI process. This includes re-computation following changes in ClinVar and GeneMap databases in different scenarios presented in the paper: blind re-computation, partial re-computation, partial re-computation with input difference and scoped partial re-computation with input difference. Interested reader please contact authors for more detailed explanation.
    • Dataset
  • Data for: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised Learning
    182 simulated datasets (first set contains small datasets and second set contains large datasets) with different cluster compositions – i.e., different number clusters and separation values – generated using clusterGeneration package in R. Each set of simulation datasets consists of 91 datasets in comma separated values (csv) format (total of 182 csv files) with 3-15 clusters and 0.1 to 0.7 separation values. Separation values can range between (−0.999, 0.999), where a higher separation value indicates cluster structure with more separable clusters. Size of the dataset, number of clusters, and separation value of the clusters in the dataset is printed in file name. size_X_n_Y_sepval_Z.csv: Size of the dataset = X number of clusters in the dataset = Y separation value of the clusters in the dataset = Z
    • Dataset
  • Replication Package for: Theodolite: Scalability Benchmarking of Distributed Stream Processing Engines in Microservice Architectures
    This repository contains a replication package and experimental results for our study Theodolite: Scalability Benchmarking of Distributed Stream Processing Engines in Microservice Architectures. The following description can also be found in the README.md file. Repeating Benchmark Execution The following introduction describes how to repeat our scalability experiments. If you plan to conduct your own studies, we suggest to use the latest version of Theodolite with significantly enhanced usability. The Apache Kafka Streams scalability experiments of our study were executed with Theodolite v0.1.2. To repeat our Kafka Streams experiments: Clone and install Theodolite v0.1.2 according to the official documentation located in execution. Copy the file repeat-kstream.sh into Theodolite's execution directory. Run the repetition file with ./repeat-kstream.sh from within the execution directory. Our Apache Flink benchmark implementations are currently migrated to the latest version of Theodolite. Theodolite's apache-flink Branch provides the basis for our Flink scalability experiments. To repeat them: Clone Theodolite's apache-flink Branch and install Theodolite according to the official documentation located in execution (should be identical to the installation for Kafka Streams (see above)). Copy the files repeat-flink-without-checkpointing.sh and repeat-flink-with-checkpointing.sh into Theodolite's execution directory. Switch to the execution directory. Run the first repetition file with ./repeat-flink-with-checkpointing.sh. Disable checkpointing by reconfiguring the Kubernetes resources jobmanager-job.yaml and taskmanager-job-deployment.yaml for each benchmark (uc{1,2,3,4}-application) by setting the environment variable CHECKPOINTING to "false". Run the second repetition file with ./repeat-flink-without-checkpointing.sh. Please note that the naming of our benchmarks recently changed. While our publication already uses the new naming, the corresponding Theodolite versions are is still using the old one. Specifically, this means that UC1 in the publication is UC1 in Theodolite, UC2 in the publication is UC3 in Theodolite, UC3 in the publication is UC4 in Theodolite, and UC4 in the publication is UC2 in Theodolite. Raw Measurements The results of above benchmark execution can be found in the measurements directory. These are CSV files, containing the measured lag trend over time for a certain subexperiment. Theodolite creates a bunch of additional files, which serve for debugging and preliminary interpretation. As these files are not required for replication, we do not included them in this package. The CSV files are named according to the schema exp{id}_{uc}_{load}_{inst}_totallag.csv, where {id} represents the experiment ID, assigned by Theodolite, {uc} the benchmark name, {load} the generated load, and {inst} the number of evaluated instances. The CSV table experiments.csv provides an overview about the configurations used in each experiment. Reproducing Scalability Analysis The following introduction describes how to repeat our scalability analysis, either with our measurements or with your own. If you plan to conduct your own studies, we suggest to use the latest version of Theodolite with significantly enhanced usability. Analyzing the Theodolite's measurements is done using two Jupyter notebooks. In general, these notebooks should be runnable by any Jupyter server. Python 3.7 or 3.8 is required (e.g., in a virtual environment) as well as some Python libraries, which can be installed via: pip install -r requirements.txt. See the Theodolite documentation for additional installation guidance. Obtaining a Scalability Graph as a CSV File The scalability-graph.ipynb notebook combines the measurements (i.e., the totallag.csv files) of one experiment. It produces a CSV file, which provides a mapping of load intensities to minimum required resources for that load (i.e., the scalability graph). The CSV files are named according to the schema exp{id}_min-suitable-instances.csv, where {id} represents the experiment ID. Additional guidance is provided in the notebook. Resulting Scalability Graph CSV Files The results directory provides the scalability graphs for all our executed experiments. Visualization of the Scalability Graph The scalability-graph-plotter.ipynb notebook creates PDF plots of a scalability graph and allows to combine multiple scalability graphs in one plot. It can be adjusted to match the desired visualization. Acknowledgments This research is funded by the German Federal Ministry of Education and Research (BMBF) under grant no. 01IS17084 and is part of the Titan project.
    • Dataset