On Exploring Data Lakes by Finding Compact, Isolated Clusters

Name: On Exploring Data Lakes by Finding Compact, Isolated Clusters
Creator: Rafael Corchuelo
Published: 2021-10-29T12:43:59.817Z
Keywords: Genetic Algorithm, Metaheuristics, Clustering, Data Analysis

Corchuelo, Rafael; Jiménez, Patricia; Roldán, Juan C.

doi:10.17632/y5v2zy356t.1

On Exploring Data Lakes by Finding Compact, Isolated Clusters

Published: 29 October 2021| Version 1 | DOI: 10.17632/y5v2zy356t.1

Contributors:

Rafael Corchuelo, Patricia Jiménez, Juan C. Roldán

Description

These are the research materials that accompany article "On Exploring Data Lakes by Finding Compact, Isolated Clusters", by Patricia Jiménez, Juan C. Roldán, and Rafael Corchuelo. This package includes the following: - "system": it provides the python code required to run and test RóMULO. There is a "launch.cmd" script that launches the experimentation. The implementation of the competitors can be found elsewhere. The implementation of GSPPCA is available from the authors at https://github.com/pamattei/GSPPCA. The implementation of AffinityPropagation, Meanshift, and OPTICS-XI is available from SckitLearn at https://scikit-learn.org/stable/install.html. The implementation of PQC is available from the authors at https://github.com/racaes/PQC. The implementation of DCC is also available from the authors at https://github.com/shahsohil/DCC. - "data-lakes": each subfolder corresponds to a data lake, and each CSV file inside a data-lake corresponds to a dataset. The data lakes in package "clustering.zip" are intended to evaluate the proposal regarding unsupervised quality coefficients (the class attribute is set to zero in all cases). The data lakes in package "classification.zip" are intended to evaluate the proposal regarding supervised quality coefficients (the class attributed is encoded using an enumerated natural number). - "results": it provides the results of evaluating RóMULO and other competitors on the previous data lakes.

On Exploring Data Lakes by Finding Compact, Isolated Clusters

Description

Files

Steps to reproduce

Institutions

Categories

Licence