Datasets and source code for a pipeline architecture for feature-based unsupervised clustering using multivariate time series from HPC jobs

Published: 14 December 2022| Version 2 | DOI: 10.17632/hgkv9cpnmn.2
, José Fuentes,


This repository is composed of 2 compressed files, with the contents as next described. --- code.tar.gz --- The source code that implements the pipeline, as well as code and scripts needed to retrieve time series, create the plots or run the experiments. More specifically: + and ⇨ The Python programs that implement the pipeline, both the auxiliary and the main pipeline stages, respectively. + 'anomaly' and 'config' folders ⇨ Scripts and Python files containing the configuration and some basic functions that are used to retrieve the information needed to process the data, like the actual resource time series from OpenTSDB, or the job metadata from Slurm. + 'functions' folder ⇨ Several folders with the Python programs that implement all the stages of the pipeline, either for the Machine Learning processing (e.g., extractors, aggregators, models), or the technical aspect of the pipeline (e.g., pipelines, transformer). + ⇨ A Python program used to create the different plots presented, from the resource time series to the evaluation plots. + several bash scripts ⇨ Used to run the experiments using a specific configuration, whether regarding which transformers are chosen and how they are parametrized, or more technical aspects involving how the pipeline is executed. --- data.tar.gz --- The actual data and results, organized as follows: + jobs ⇨ All the jobs' resource time series plots for all the experiments, with a folder used for each experiment. Inside each folder all the jobs are separated according to their id, containing the plots for the different system resources (e.g., User CPU, Cached memory). + plots ⇨ All the predictions' plots for all the experiments in separated folders, mainly used for evaluation purposes (e.g., scatter plot, heatmaps, Andrews curves, dendrograms). These plots are available for all the predictors resulting from the pipeline execution. In addition, for each predictor it is also possible to visualize the resource time series grouped by clusters. Finally, the projections as generated by the dimension reduction models, and the outliers detected, are also available for each experiment. + datasets ⇨ The datasets used for the experiments, which include the lists of job IDs to be processed (CSV files) and the results of each stage of the pipeline (e.g., features, predictions), and the output text files as generated by several pipeline stages. Among these latter files it is worth to note the evaluation ones, that include all the predictions scores.



Centro de Supercomputacion de Galicia, Universidade da Coruna


High Performance Computing, Unsupervised Learning, Time Series Analysis