SDCOR Synthetic Datasets

Published: 23 August 2021| Version 5 | DOI: 10.17632/p4tx2k852r.5
Contributor:
Sayyed-Ahmad Naghavi-Nozad

Description

SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets Link to arXiv e-print: https://arxiv.org/pdf/2006.07616.pdf Link to ResearchGate e-print: https://www.researchgate.net/publication/342197681_SDCOR_Scalable_Density-based_Clustering_for_Local_Outlier_Detection_in_Massive-Scale_Datasets This paper presents a method for local outlier detection in massive-scale datasets, which is based on a batch-wise density-based clustering approach. SDCOR consists of three major phases: 1) Sampling; 2) Scalable Clustering; and 3) Scoring. In the Sampling phase, a preliminary random sampling is conducted to obtain an abstraction of the entire data, named temporary clustering model; and also to acquire some information over the necessary parameters for the clustering procedures. Then, the Scalable Clustering phase will commence and the input data will be processed in chunks; as by processing successive chunks, the temporary clustering model gets gradual updates, till it turns into the final clustering model after processing the last chunk. Ultimately, at the last phase of the algorithm, regarding the final clustering model attained through the batch-wise clustering, and by employing the Mahalanobis distance criterion, each object is given an outlying score called SDCOR, which is equal to its local Mahalanobis distance. Each synthetic dataset in this repository is made of some Gaussian clusters with arbitrary mean vectors, far enough from each other, to impede probable overlappings among multidimensional clusters. For each of these artificial datasets, a specific amount of outliers are added around every cluster in the corresponding data; and moreover, the outliers "truth" is available along with each synthetic data. For every artificial dataset, there is a n-by-p matrix of dataset X (as n and p stand for the cardinality and dimensionality of the input data, respectively), along with the n-by-1 vector y of outlier labels, all together as a single binary MAT-file. We have implemented our code in MATLAB 9, which due to becoming reproducible, is accessible through our GitHub page (https://github.com/sana33/SDCOR). Finally, if you are interested in the idea or you are using this data for your research, please cite our paper as: @article{naghavi2021sdcor, title={SDCOR: Scalable density-based clustering for local outlier detection in massive-scale datasets}, author={Naghavi Nozad, Sayyed Ahmad and Amir Haeri, Maryam and Folino, Gianluigi}, journal={Knowledge-Based Systems}, pages={107256}, year={2021}, publisher={Elsevier} } Thanks a lot ...

Files

Steps to reproduce

One can load each dataset through MATLAB software and apply any kind of data preprocessing to it, in any way they prefer.

Institutions

Amirkabir University of Technology

Categories

Clustering, Outlier

Licence