Data for: Employing Partial Least Squares Regression with Discriminant Analysis for Bug Prediction

Published: 23-07-2020| Version 1 | DOI: 10.17632/cb22t5225n.1
Robert Rajko,
Rudolf Ferenc,
Istvan Siket,
Peter Hegedus


For creating, optimizing, and evaluating our statistical model, we used the Public Unified Bug Dataset for Java. It contains the data entries of 5 different public bug datasets (PROMISE, Eclipse Bug Dataset, Bug Prediction Dataset, Bugcatchers Bug Dataset, and GitHub Bug Dataset) in a unified manner. The dataset contains 47,618 Java Classes altogether, from which 8,780 contain at least one bug, while 38,838 are bug-free. The total number of bugs recorded in the dataset is 17,365, which means that each bugged Java Class contains 1.98 bugs in average (with standard deviation of 2.39). Unfortunately, the PLS-DA implementation in PLS_Toolbox was too slow due to the tremendous amount of administrative calculations it performs. Therefore, we have developed and used a much faster PLS-DA script independently from PLS_Toolbox. According to the literature, there is no obvious way to choose the fastest and most accurate algorithm. Thus, we had to find the right balance between speed and accuracy, and chose the bidiag2stab method for our implementation. For tuning the model parameters and finding the best possible classification, we performed many model training runs, thus a very fast PLS core implementation was essential. With our PLS-DA Matlab script, we generated a classification using data splitting of 80% training, 10% validation and 10% test sets.