Data lakes for clustering

Published: 8 February 2021| Version 1 | DOI: 10.17632/js8df95fzc.1
Contributor:
Patricia Jimenez

Description

This dataset describes the on-line materials that accompany article "RÓMULO: A Clustering Proposal in the Context of Data Lakes", by Patricia Jiménez, Juan C. Roldán, and Rafael Corchuelo. The materials are organised into the following folders: - "data-lakes": each subfolder corresponds to a data lake, and each CSV file inside a data-lake corresponds to a dataset. The last column of the datasets is called "clazz", but it is set to "0" in all cases. A few of the original datasets had a class, but it was removed to ensure that neither RóMULO nor the other competitors use it since they all are unsupervised proposals. - "results": it provides the results of testing RóMULO and other competitors on the previous data lakes. The results consist of several "*-results.csv" files that provide effectiveness and efficiency results for each proposal used in the experimentation. - "system": it provides the python code required to run and test RóMULO. There is a "launch.cmd" script that launches the experimentation. COMPETITORS ------------------- The implementation of AffinityPropagation, Meanshift, and OPTICS-XI is available in SckitLearn. The implementation of GSPPCA is available from the authors at https://github.com/pamattei/GSPPCA. THe implementation of PQC is available from the authors at https://github.com/racaes/PQC.

Files