RTU-HVAC Real-Time Operating Data from Unit in Field
The purpose of collecting, creating, and publishing this dataset is to develop novel machine learning methods that can classify faults for datasets that are (1) collected real-time, (2) that are not labeled, and (3) that may be imbalanced. The hypothesis was that we would be able to classify HVAC faults with these conditions with an accuracy > 90%. All the developed methods were able to classify all seven considered faults categories (UCC1, OCC1, UCC2, OCC2, CA, EA, NF); however, only five classes are identified and analyzed because there was no instance in the datasets for the UCC1 and OCC2 faults. The average accuracy of the supervised ML method for the baseline method was high (93.5%); however, the minority class (NF) classification accuracy was low (80.6%) because of the data imbalance. A combination of SVM and a novel unsupervised ML technique that utilizes k-NN labeling (Method 2) was developed. This method is very promising, as it shows a high average accuracy (94.9%) even with a few labeled data points and it can predict multiple faults in the same data point. This method also shows encouraging results for dealing with imbalanced datasets without the need for additional techniques to generate new data points to balance all classes. A combination of SVM, clustering, and unsupervised learning of k-NN labeling (Method 3) was developed. This method is limited to a scenario where only one fault at a time is present in the dataset; however, it is a powerful approach to deal with limited labeled data points. The highest average accuracy was achieved using 50 k-NN. Interestingly, all OCC1 and UCC2 testing data points are correctly predicted, while there were a few data points that were misclassified for CA, EA, and NF. Even though the imbalanced dataset challenge can be handled by using different techniques, the main drawback of this method is the presence of multiple faults in the same data point. Finally, an ensemble method was developed to select between Methods 1 and 2 for each fault type. Rather than looking at the overall accuracy of each method, this method looks at the accuracy of each individual classifier (one classifier for each fault or class). This is useful when it is necessary to select between different methods (SVM or a combination of SVM and unsupervised ML of k-NN labeling) for each classifier, to achieve better predictions, and an overall higher average accuracy.
Steps to reproduce
The detailed methodology to collect and store the data is outlined in the paper, "Bringing Automated Fault Detection and Diagnostics into the Mainstream," (https://doi.org/10.1115/1.4047958) . The sensors and data logger record the data continuously with 1-minute intervals for each input variable on a rooftop unit operating during a summer period at an industrial facility in Connecticut. There are two datasets used for this study, shown on two separate pages in the Excel file. The first dataset includes a total of 4,284 data points (three days worth of data); however, only 3,336 data points, representing faulty and unfaulty data with a total of 30 features were used after excluding the data points where the RTU was off. The second dataset has the same features with 2,873 data points (two days worth of data), but only 2,099 data points are considered after omitting the system-off data points. The second dataset differs from the first dataset in that we assume that only one fault occurs per data point in the second dataset. Both datasets are for the same RTU, collected using the same sensors, and have the same features.