SISAP 2023 Indexing challenge –⁠ Learned Metric Index: Raw data, analyses, figures

Published: 11 October 2023| Version 1 | DOI: 10.17632/3dp7jfv2vh.1
Contributors:
Terézia Slanináková,
,
,
,

Description

==== For complete code, description, data, and steps to reproduce, visit: https://github.com/LearnedMetricIndex/LearnedMetricIndex/tree/paper-sisap23-indexing-challenge ==== This repository contains the data for our submission to the SISAP 2023 Indexing challenge. We used a strip-down version of the Learned Metric Index (LMI), which is an index for approximate nearest neighbor search on complex data using machine learning and probability-based navigation. **Getting started** Follow the instructions in README.md –⁠ https://github.com/LearnedMetricIndex/LearnedMetricIndex/tree/paper-sisap23-indexing-challenge **Contents** 1. result/ - contains the raw .h5 files of each experiment (with varying hyperparameters), 2088 experiment in total 2. res.csv - contains the evaluation of every experiment (1 row) in terms of recall and query time 3. 02-Analyze-results.ipynb - Jupyter notebook used to analyze the results and plot the figures 4. cat.pdf, nobjects.pdf - figures used in the paper **Related Publications** > M. Antol, J. Ol'ha, T. Slanináková, V. Dohnal: [Learned Metric Index—Proposition of learned indexing for unstructured data](https://www.sciencedirect.com/science/article/pii/S0306437921000326?casa_token=EvG8iaWkqQUAAAAA:xgfbutrsNGcBXnTN-U4MQ65hgmPE3fAyzwqtijzGC-JRrkO1IYNmcN3A8yMsSOT3CCoHpqVtMA). Information Systems, 2021 - Elsevier (2021) > T. Slanináková, M. Antol, J. Ol'ha, V. Kaňa, V. Dohnal: [Learned Metric Index—Proposition of learned indexing for unstructured data](https://link.springer.com/chapter/10.1007/978-3-030-89657-7_7). SISAP 2021 - Similarity Search and Applications pp 81-94 (2021) > J. Ol'ha, T. Slanináková, M. Gendiar, M. Antol, V. Dohnal: [Learned Indexing in Proteins: Extended Work on Substituting Complex Distance Calculations with Embedding and Clustering Techniques](https://arxiv.org/abs/2208.08910), and [Learned Indexing in Proteins: Substituting Complex Distance Calculations with Embedding and Clustering Techniques](https://link.springer.com/chapter/10.1007/978-3-031-17849-8_22) SISAP 2022 - Similarity Search and Applications pp 274-282 (2022) > T. Slanináková, M. Antol, J. Ol'ha, V. Kaňa, V. Dohnal, S. Ladra, M. A. Martinez-Prieto: [Reproducible experiments with Learned Metric Index Framework](https://www.sciencedirect.com/science/article/pii/S0306437923000911). Information Systems, Volume 118, September 2023, 102255 (2023) **Mendeley dataset**: https://data.mendeley.com/datasets/8wp73zxr47/12 ** Authors** - Terézia Slanináková, Masaryk University - David Procházka, Masaryk University - Jaroslav Oľha, Masaryk University - Matej Antol, Masaryk University - Vlastislav Dohnal, Masaryk University

Files

Steps to reproduce

==== Note that this repository does not contain code. For complete code, description, data, and steps to reproduce, visit: https://github.com/LearnedMetricIndex/LearnedMetricIndex/tree/paper-sisap23-indexing-challenge ==== # Getting started See examples of how to index and search in a dataset in 01-Introduction.ipynb For analysis of our raw results (result/) and reproduction of our figures, see: 02-Analyze-results.ipynb ## Installation of the repository (see https://github.com/LearnedMetricIndex/LearnedMetricIndex/tree/paper-sisap23-indexing-challenge) ### Using conda ```bash conda create -n env python=3.8 conda activate env conda install matplotlib pandas scikit-learn jupyterlab pip install h5py flake8 setuptools tqdm faiss-cpu pip install torch --index-url https://download.pytorch.org/whl/cpu pip install --editable . ``` ## Reproducing the results ```bash python3 search/search.py # to run a test run with a small dataset (100K) python3 search/search.py --size=10M # to produce a .h5 file with a single experiment result for the 10M set, which will be stored in results/. Note that the default script parameters are set to reflect the best-found setup. In order to reproduce all the results in result/, set the script parameters to reflect the values stored in the experiment name. # Alternatively, start with the complete `results` from this Mendeley repository python3 eval/eval.py # to produce a res.csv with evaluations of *.h5 files in results/ jupyter-lab # to open 02-Analyze-results.ipynb and generate the figures ```

Institutions

Masarykova univerzita Fakulta informatiky

Categories

Machine Learning, Similarity Measure, Information Indexing, Metric Space

Licence