uiHRDC

Published: 1 March 2018| Version 1 | DOI: 10.17632/xxntkjvtxw.1
Contributors:
, Miguel A. Martínez-Prieto,
,

Description

uiHRDC (universal indexes for Highly Repetitive Document Collections) is a replication framework licensed under the GNU Lesser General Public License v2.1 (GNU LGPL). It includes all the required elements to reproduce the main experiments of the paper [1], including datasets, query patterns, source code and scripts. The general structure of the uiHRDC repository includes: i) a directory benchmark which contains a LATEX formatted report and a script that will collect all the data files resulting from running all the experiments and will generate a PDF report with all the most relevant figures; ii) a directory data, which includes the text collections (7z compressed), and the query patterns. iii) directories indexes and self-indexes that contain the source code for each indexing alternative, and scripts that permit to run all the experiments for each technique (it includes the construction of each compressed index of interest (using a builder program) and then performing both locate and extract operations over that index (using the corresponding searcher program). Each experiment will output relevant data to a results-data file); and iv) a script doAll.sh that will drive all the process of decompressing the source collections; compiling the sources for each index and running the experiments with it; and finally, generating the final report. [1] F. Claude, A. Fariña, M. A. Martínez-Prieto, and G. Navarro. Universal Indexes for Highly Repetitive Document Collections. Information Systems, 61:1–23, 2016.

Files

Steps to reproduce

We provide a Docker environment that will allow: 1. Reproduce the paper test framework. We create a docker image with Ubuntu 14 (ubuntu:trusty) that includes all the libraries and software requirements to compile/run our indexing alternatives and also build the final report. These includes packages that are installed via apt such as: gcc-multilib, g++-multilib, cmake, libboost-all-dev, p7zip-full, openssh-server, screen, gnuplot-qt, texlive-latex-base, and texlive- fonts-recommended; and finally, snappy-1.1.1 15 , which we included in a snappy-1.1.1.tar.gz file. In addition, the contents of the uiHRDC framework are also downloaded from our repository and copied into /home/user/uiHRDC directory of our docker image. 2. Connect to an instance of our docker image using ssh/sftp. Basically, we expose port 22 and create a user ’user’ with password ’userR1’ who can connect via ssh. This will allow the reader to connect to the docker container as to any remote server and to retrieve the final report by sftp. Once connected, user has sudo priviledges and, for example, can become root simply entering sudo su. 3. Run doAll.sh script to automatically run all our experiments and generate our final report. This script must be run by root user. Note that we have installed (via apt-get) screen virtual terminal so that the user can disconnect from the docker container and still keep doAll.sh script running. The minimum hardware requirements to run doAll.sh script would be to have a machine with at least 32GB RAM (and 16GB swap) and around 200GB of free disk space (in the partition were Docker keeps its files). In our machine, i7-8700K@3.70GHz CPU (6 cores/12 siblings) with 64 GB of DDR4@2400MHz memory and a 7200rpm SATA disk, it took around 40 hours to run all the experiments from doAll.sh script.

Institutions

Universidad de Chile, Universidad Diego Portales, Universidad de Valladolid, Universidade da Coruna

Categories

Data Compression

Licence