Reproducible experiments on Learned Metric Index – proposition of learned indexing for unstructured data
With this collection of code and configuration files (contained in "LMIF" = 'Learned Metric Index Framework'), outputs ("output-files") and datasets ("datasets") we set out to explore whether a learned approach to building a metric index is a viable alternative to the traditional way of constructing metric indexes. Specifically, we build the index as a series of interconnected machine learning models. This collection serves as the basis for the reproducibility paper accompanying our parent paper -- "Learned metric index—proposition of learned indexing for unstructured data" . 1. In "datasets" we make publicly available a collection of 3 individual dataset descriptors -- CoPhIR (1 million objects, 282 columns), Profimedia (1 million objects, 4096 columns), and MoCap (~350k objects, 4096 columns), "labels" obtained from a template index -- M-tree or M-index, "queries" used to perform an experimental search with and "ground-truths" to evaluate the approximate k-NN performance of the index. Within "test" we include dummy data to ease the integration of any custom dataset (examples in "LMIF/*.ipynb") that a reader may want to integrate into our solution. In CoPhIR , each of the vectors is obtained by concatenating five MPEG-7 global visual descriptors extracted from an image downloaded from Flickr. The Profimedia image dataset , contains Caffe visual descriptors extracted from Photo-stock images by a convolutional neural network. MoCap (motion capture data)  descriptors contain sequences of 3D skeleton poses extracted from 3+ hrs of recordings capturing actors performing more than 70 different motion scenarios. The dataset's size is 43 GB upon decompression.  Antol, Matej, et al. "Learned metric index—proposition of learned indexing for unstructured data." Information Systems 100 (2021): 101774.  Batko, Michal, et al. "Building a web-scale image similarity search system." Multimedia Tools and Applications 47.3 (2010): 599-629.  Budikova, Petra et al. "Evaluation platform for content-based image retrieval systems." International Conference on Theory and Practice of Digital Libraries. Springer, Berlin, Heidelberg, 2011.  Müller, Meinard, et al. "Documentation mocap database hdm05." (2007). 2. "LMIF" contains a user-friendly environment to reproduce the experiments in . LMIF consists of three components: - an implementation of the Learned Metric Index (distributed under the MIT license), - a collection of scripts and configuration setups necessary for re-running the experiments in  and - instructions for creating the reproducibility environment (Docker). For a thorough description of "LMIF", please refer to our reproducibility paper -- "Reproducible experiments on Learned Metric Index – proposition of learned indexing for unstructured data". 3. "output-files" contain the reproduced outputs for each experiment, with generated figures and a concise ".html" report (as presented in )
Steps to reproduce
See Description above for details regarding "datasets/". Steps to reproduce "output-files/": Prerequisites: - 43 GB of storage space, - 350 GB of main memory, - ~74 days of running time - Docker $ # (1) Download the data from Mendeley on your machine $ # (2) Extract the compressed input data files $ unzip datasets.zip $ # (3) Extract the compressed source code files $ unzip LMIF.zip $ # (4) Build the image in the source code directory: $ cd LMIF $ docker build -t repro-lmi -f Dockerfile . --network host $ cd .. $ # (5) Check the presence of `repro-lmi` in the list of images: $ docker images $ # (6) Start the Docker image interactively and map the input and output directories from/to your host machine: $ docker run -it -v <<host-machine-current-path-full>>/LMIF/outputs:/learned-indexes/outputs -v <<host-machine-current-path-full>>/datasets:/learned-indexes/data repro-lmi /bin/bash $ # (7) Verify that the 'outputs/' folder does not contain any output directories from a previous run. $ ls outputs/ # returns only `report-template.html` $ # (8) Run the experiments, save the log output $ python3 run-experiments.py experiment-setups/**/*.yml 2>&1 | tee outputs/experiments-output.log $ # (9) Generate the report $ python3 create-report.py outputs/ $ # (10) Log out of the docker image $ exit $ # (11) If not planning on using anymore -- remove the docker image from the host machine $ docker image rm repro-lmi -f All of the output files will be stored in the "outputs/" directory (and will be mapped to <<host-machine-full-path>>/outputs). The "report.html" contains the reproduced results.