Reproducible experiments on Learned Metric Index – proposition of learned indexing for unstructured data

Published: 1 July 2022| Version 3 | DOI: 10.17632/8wp73zxr47.3
Contributors:
Terézia Slanináková,
,
,

Description

With this collection of code and configuration files (contained in "LMIF" = 'Learned Metric Index Framework'), outputs ("output-files") and datasets ("datasets") we set out to explore whether a learned approach to building a metric index is a viable alternative to the traditional way of constructing metric indexes. Specifically, we build the index as a series of interconnected machine learning models. This collection serves as the basis for the reproducibility paper accompanying our parent paper -- "Learned metric index—proposition of learned indexing for unstructured data" [1]. 1. In "datasets" we make publicly available a collection of 3 individual dataset descriptors -- CoPhIR (1 million objects, 282 columns), Profimedia (1 million objects, 4096 columns), and MoCap (~350k objects, 4096 columns), "labels" obtained from a template index -- M-tree or M-index, "queries" used to perform an experimental search with and "ground-truths" to evaluate the approximate k-NN performance of the index. Within "test" we include dummy data to ease the integration of any custom dataset (examples in "LMIF/*.ipynb") that a reader may want to integrate into our solution. In CoPhIR [2], each of the vectors is obtained by concatenating five MPEG-7 global visual descriptors extracted from an image downloaded from Flickr. The Profimedia image dataset [3], contains Caffe visual descriptors extracted from Photo-stock images by a convolutional neural network. MoCap (motion capture data) [4] descriptors contain sequences of 3D skeleton poses extracted from 3+ hrs of recordings capturing actors performing more than 70 different motion scenarios. The dataset's size is 43 GB upon decompression. [1] Antol, Matej, et al. "Learned metric index—proposition of learned indexing for unstructured data." Information Systems 100 (2021): 101774. [2] Batko, Michal, et al. "Building a web-scale image similarity search system." Multimedia Tools and Applications 47.3 (2010): 599-629. [3] Budikova, Petra et al. "Evaluation platform for content-based image retrieval systems." International Conference on Theory and Practice of Digital Libraries. Springer, Berlin, Heidelberg, 2011. [4] Müller, Meinard, et al. "Documentation mocap database hdm05." (2007). 2. "LMIF" contains a user-friendly environment to reproduce the experiments in [1]. LMIF consists of three components: - an implementation of the Learned Metric Index (distributed under the MIT license), - a collection of scripts and configuration setups necessary for re-running the experiments in [1] and - instructions for creating the reproducibility environment (Docker). For a thorough description of "LMIF", please refer to our reproducibility paper -- "Reproducible experiments on Learned Metric Index – proposition of learned indexing for unstructured data". 3. "output-files" contain the reproduced outputs for each experiment, with generated figures and a concise ".html" report (as presented in [1])

Files

Steps to reproduce

See Description above for details regarding "datasets/". Steps to reproduce "output-files/": Prerequisites: - 43 GB of storage space, - 350 GB of main memory, - ~74 days of running time - Docker $ # (1) Download the datasets and source code from Mendeley to a folder of your choice on your machine $ # (2) Extract the compressed input data files $ unzip datasets/*.zip datasets/ $ # (3) Build the image in the source code directory: $ cd LMIF $ docker build -t repro-lmi -f Dockerfile . $ cd .. $ # (4) Check the presence of `repro-lmi` in the list of images: $ docker images $ # (5) Create an empty `outputs` directory to store the experiment outputs on your host machine: $ mkdir outputs $ # (6) Start the Docker image interactively and map the input and output directories from/to your host machine: $ docker run -it -v <<host-machine-full-path>>/outputs:/learned-indexes/outputs -v <<host-machine-full-path>>/datasets:/learned-indexes/data repro-lmi $ # (7) Run the experiments, save the log output $ python3 run-experiments.py experiment-setups/**/*.yml |& tee outputs/experiments-output.log $ # (8) Generate the report $ python3 create-report.py outputs/ All of the output files will be stored in the "outputs/" directory (and will be mapped to <<host-machine-full-path>>/outputs). The "report.html" contains the reproduced results.

Institutions

Masarykova univerzita Fakulta informatiky

Categories

Machine Learning, Similarity Measure, Information Indexing, Metric Space

Licence