A Systematic Review of Unsupervised Defect Prediction Dataset

Published: 17-02-2020| Version 1 | DOI: 10.17632/h24ctmyx73.1
Ning Li,
Martin Shepperd,
Yuchen GUO


This dataset is about a systematic review of unsupervised learning techniques for software defect prediction (our related paper: "A Systematic Review of Unsupervised Learning Techniques for Software Defect Prediction" in Information and Software Technology [accepted in Feb, 2020] ). We conducted this systematic literature review that identified 49 studies which satisfied our inclusion criteria containing 2456 individual experimental results. In order to compare prediction performance across these studies in a consistent way, we recomputed the confusion matrices and employed MCC as our main performance measure. From each paper we extracted: Title, Year, Journal/conference, 'Predatory' publisher? (Y | N), Count of results reported in paper, Count of inconsistent results reported in paper, Parameter tuning in SDP? (Yes | Default | ?) and SDP references(SDPRefs OrigResults | SDPRefs |SDPNoRefs | OnlyUnSDP). Then from within each paper, we extracted for each experimental result including: Prediction method name (e.g., DTJ48), Project name trained on (e.g., PC4), Project name tested on (e.g., PC4), Prediction type (within-project | cross-project), No. of input metrics (count | NA), Dataset family (e.g., NASA), Dateset fault rate (%), Was cross validation used? (Y | N | ?), Was error checking possible? (Y | N), Inconsistent results? (Y | N | ?), Error reason description (text), Learning type (Supervised | Unsupervised), Clustering method? (Y | N | NA), Machine learning family (e.g., Un-NN), Machine learning technique (e.g., KM), Prediction results (including TP, TN, FP, FN, etc.).