Finding Compact, Isolated Clusters in Data Lakes

Published: 30 July 2021| Version 1 | DOI: 10.17632/j9c9pnzn5z.1
Contributor:
Patricia Jimenez

Description

Data lakes for clustering ------------------------- These are the research materials that accompany article "On Exploring Data Lakes by Finding Compact, Isolated Clusters", by Patricia Jiménez, Juan C. Roldán, and Rafael Corchuelo. This package includes the following: - "data-lakes": each subfolder corresponds to a data lake, and each CSV file inside a data-lake corresponds to a dataset. The last column of the datasets is called "clazz", but it is set to "0" in all cases. A few of the original datasets had a class, but it was removed to ensure that neither RóMULO nor the other competitors use it. - "results": it provides the results of testing RóMULO and other competitors on the previous data lakes. The results consist of several "*-results.csv" files that provide effectiveness and efficiency results for each proposal used in the experimentation. - "system": it provides the python code required to run and test RóMULO. There is a "launch.cmd" script that launches the experimentation. The implementation of the competitors can be found elsewhere. The implementation of GSPPCA is available from the authors at https://github.com/pamattei/GSPPCA. The implementation of AffinityPropagation, Meanshift, and OPTICS-XI is available from SckitLearn at https://scikit-learn.org/stable/install.html. The implementation of PQC is available from the authors at https://github.com/racaes/PQC. The implementation of DCC is also available from the authors at https://github.com/shahsohil/DCC.

Files

Steps to reproduce

To run the code follow the instructions provided in system folder.