Published: 1 Mar 2018 | Version 1 | DOI: 10.17632/xxntkjvtxw.1

Description of this data

uiHRDC (universal indexes for Highly Repetitive Document Collections) is a replication framework licensed under the GNU Lesser General Public License v2.1 (GNU LGPL). It includes all the required elements to reproduce the main experiments of the paper [1], including datasets, query patterns, source code and scripts.

The general structure of the uiHRDC repository includes: i) a directory benchmark which contains a LATEX formatted report and a script that will collect all the data files resulting from running all the experiments and will generate a PDF report with all the most relevant figures; ii) a directory data, which includes the text collections (7z compressed), and the query patterns. iii) directories indexes and self-indexes that contain the source code for each indexing alternative, and scripts that permit to run all the experiments for each technique (it includes the construction of each compressed index of interest (using a builder program) and then performing both locate and extract operations over that index (using the corresponding searcher program). Each experiment will output relevant data to a results-data file); and iv) a script doAll.sh that will drive all the process of decompressing the source collections; compiling the sources for each index and running the experiments with it; and finally, generating the final report.

[1] F. Claude, A. Fariña, M. A. Martínez-Prieto, and G. Navarro. Universal Indexes for Highly Repetitive Document Collections. Information Systems, 61:1–23, 2016.

Experiment data files

Steps to reproduce

We provide a Docker environment that will allow:

  1. Reproduce the paper test framework. We create a docker image with Ubuntu 14 (ubuntu:trusty) that includes all the libraries and software requirements to compile/run our indexing alternatives and also build the final report. These includes packages that are installed via apt such as: gcc-multilib, g++-multilib, cmake, libboost-all-dev, p7zip-full, openssh-server, screen, gnuplot-qt, texlive-latex-base, and texlive- fonts-recommended; and finally, snappy-1.1.1 15 , which we included in a snappy-1.1.1.tar.gz file. In addition, the contents of the uiHRDC framework are also downloaded from our repository and copied into /home/user/uiHRDC directory of our docker image.

  2. Connect to an instance of our docker image using ssh/sftp. Basically, we expose port 22 and create a user ’user’ with password ’userR1’ who can connect via ssh. This will allow the reader to connect to the docker container as to any remote server and to retrieve the final report by sftp. Once connected, user has sudo priviledges and, for example, can become root simply entering sudo su.

  3. Run doAll.sh script to automatically run all our experiments and generate our final report. This script must be run by root user. Note that we have installed (via apt-get) screen virtual terminal so that the user can disconnect from the docker container and still keep doAll.sh script running.

The minimum hardware requirements to run doAll.sh script would be to have a machine with at least 32GB RAM (and 16GB swap) and around 200GB of free disk space (in the partition were Docker keeps its files). In our machine, i7-8700K@3.70GHz CPU (6 cores/12 siblings) with 64 GB of DDR4@2400MHz memory and a 7200rpm SATA disk, it took around 40 hours to run all the experiments from doAll.sh script.

Related links

peer reviewed

This data is associated with the following peer reviewed publication:

Universal indexes for highly repetitive document collections

Published in: Information Systems

Latest version

  • Version 1


    Published: 2018-03-01

    DOI: 10.17632/xxntkjvtxw.1

    Cite this dataset

    Fariña, Antonio; Martínez-Prieto, Miguel A.; Claude, Francisco; Navarro, Gonzalo (2018), “uiHRDC”, Mendeley Data, v1 http://dx.doi.org/10.17632/xxntkjvtxw.1


University of Chile, Diego Portales University, University of Valladolid, University of A Coruna


Data Compression

Mendeley Library

Organise your research assets using Mendeley Library. Add to Mendeley Library


CC BY 4.0 Learn more

The files associated with this dataset are licensed under a Creative Commons Attribution 4.0 International licence.

What does this mean?

This dataset is licensed under a Creative Commons Attribution 4.0 International licence. What does this mean? You can share, copy and modify this dataset so long as you give appropriate credit, provide a link to the CC BY license, and indicate if changes were made, but you may not do so in a way that suggests the rights holder has endorsed you or your use of the dataset. Note that further permission may be required for any content within the dataset that is identified as belonging to a third party.