Reproducible experiments on Learned Metric Index – proposition of learned indexing for unstructured data

Published: 2 August 2022| Version 5 | DOI: 10.17632/8wp73zxr47.5
Contributors:
Terézia Slanináková,
,
,

Description

With this collection of code and configuration files (contained in "LMIF" = 'Learned Metric Index Framework'), outputs ("output-files") and datasets ("datasets") we set out to explore whether a learned approach to building a metric index is a viable alternative to the traditional way of constructing metric indexes. Specifically, we build the index as a series of interconnected machine learning models. This collection serves as the basis for the reproducibility paper accompanying our parent paper -- "Learned metric index—proposition of learned indexing for unstructured data" [1]. 1. In "datasets" we make publicly available a collection of 3 individual dataset descriptors -- CoPhIR (1 million objects, 282 columns), Profimedia (1 million objects, 4096 columns), and MoCap (~350k objects, 4096 columns), "labels" obtained from a template index -- M-tree or M-index, "queries" used to perform an experimental search with and "ground-truths" to evaluate the approximate k-NN performance of the index. Within "test" we include dummy data to ease the integration of any custom dataset (examples in "LMIF/*.ipynb") that a reader may want to integrate into our solution. In CoPhIR [2], each of the vectors is obtained by concatenating five MPEG-7 global visual descriptors extracted from an image downloaded from Flickr. The Profimedia image dataset [3], contains Caffe visual descriptors extracted from Photo-stock images by a convolutional neural network. MoCap (motion capture data) [4] descriptors contain sequences of 3D skeleton poses extracted from 3+ hrs of recordings capturing actors performing more than 70 different motion scenarios. The dataset's size is 43 GB upon decompression. [1] Antol, Matej, et al. "Learned metric index—proposition of learned indexing for unstructured data." Information Systems 100 (2021): 101774. [2] Batko, Michal, et al. "Building a web-scale image similarity search system." Multimedia Tools and Applications 47.3 (2010): 599-629. [3] Budikova, Petra et al. "Evaluation platform for content-based image retrieval systems." International Conference on Theory and Practice of Digital Libraries. Springer, Berlin, Heidelberg, 2011. [4] Müller, Meinard, et al. "Documentation mocap database hdm05." (2007). 2. "LMIF" contains a user-friendly environment to reproduce the experiments in [1]. LMIF consists of three components: - an implementation of the Learned Metric Index (distributed under the MIT license), - a collection of scripts and configuration setups necessary for re-running the experiments in [1] and - instructions for creating the reproducibility environment (Docker). For a thorough description of "LMIF", please refer to our reproducibility paper -- "Reproducible experiments on Learned Metric Index – proposition of learned indexing for unstructured data". 3. "output-files" contain the reproduced outputs for each experiment, with generated figures and a concise ".html" report (as presented in [1])

Files

Steps to reproduce

See Description above for details regarding "datasets/". Steps to reproduce "output-files/": Prerequisites: - 43 GB of storage space, - 350 GB of main memory, - ~74 days of running time - Docker $ # (1) Download the data from Mendeley on your machine $ # (2) Extract the compressed input data files $ unzip datasets.zip $ # (3) Extract the compressed source code files $ unzip LMIF.zip $ # (4) Build the image in the source code directory: $ cd LMIF $ docker build -t repro-lmi -f Dockerfile . --network host $ cd .. $ # (5) Check the presence of `repro-lmi` in the list of images: $ docker images $ # (6) Start the Docker image interactively and map the input and output directories from/to your host machine: $ docker run -it -v <<host-machine-current-path-full>>/LMIF/outputs:/learned-indexes/outputs -v <<host-machine-current-path-full>>/datasets:/learned-indexes/data repro-lmi /bin/bash $ # (7) Verify that the 'outputs/' folder does not contain any output directories from a previous run. $ ls outputs/ # returns only `report-template.html` $ # (8) Run the experiments, save the log output $ python3 run-experiments.py experiment-setups/**/*.yml 2>&1 | tee outputs/experiments-output.log $ # (9) Generate the report $ python3 create-report.py outputs/ $ # (10) Log out of the docker image $ exit $ # (11) If not planning on using anymore -- remove the docker image from the host machine $ docker image rm repro-lmi -f All of the output files will be stored in the "outputs/" directory (and will be mapped to <<host-machine-full-path>>/outputs). The "report.html" contains the reproduced results.

Institutions

Masarykova univerzita Fakulta informatiky

Categories

Machine Learning, Similarity Measure, Information Indexing, Metric Space

License