Dataset: A generalisable data-driven filtering methodology for Energy Performance Certification databases

Published: 15 March 2024| Version 1 | DOI: 10.17632/hxpmt994js.1
Kumar Raushan


The dataset supports the research conducted in "Defining a data-driven standardized generalizable methodology to validate a large EPC dataset, a case study in Ireland" by Raushan et al. (2024). It consists of a filtered Energy Performance Certificate (EPC) database for residential buildings in Ireland, obtained through rigorous data validation methods to eliminate erroneous entries and outliers. EPCs contain crucial details regarding building energy efficiency and characteristics. The original EPC database for Ireland is publicly accessible but contains over 1 million unfiltered entries with inconsistent and erroneous values, potentially biasing analysis. This processed dataset enhances the quality and reliability of EPC data for applications in building stock modeling and research. The data is openly available in .CSV format, accompanied by the methodology employed for processing the raw database, documented in comprehensive Python scripts. Supplementary notes and metadata provide insights into the filtering process, experimental design, and details of 213 variables categorized into informational, thermophysical, geometric, and system attributes. By making this standardized, data-driven filtered EPC dataset accessible, the research empowers stakeholders, both novice and expert, to utilize this higher quality input for understanding and analyzing the Irish housing stock.


Steps to reproduce

1. Load : This step imports the required Python work packages and loads the unfiltered EPC dataset into the script dataframe. The Sustainable Energy Authority of Ireland (SEAI) host the national (unfiltered) EPC dataset. This dataset is freely available and accessible online. 1.1. Download the unfiltered EPC dataset in tab delimited (.txt) format from SEAI using the link provided ( 1.2. Note the folder location of where the unfiltered file is saved. 1.3. Using Visual Studio (VS), or alternative IDE of choice, update the folder location in execution order 2 in Python_Processing_Script.ipynb with this saved location from step 1.2. In this study, VS is used to execute a Jupyter Notebook (ipynb) using a python kernel. 1.4. Execute Python_Processing_Script.ipynb using the “Run All” command in Visual Studio. 1.5. Necessary python libraries are loaded automatically, including “pandas”, “numpy”, “pyplot”, “statistics”, “sklearn”, “plotly”, and “mpl_toolkits”. 2. Check: This section of the Python_Processing_Script.ipynb script adds an essential HEX UID column to the dataset and processes the informational data to check for consistency errors. This step does not remove any erroneous data but instead creates a copy of the column and updates the entries to be consistent with the remainder of the dataset e.g., the “County Name” column is duplicated and then updated to provide a consistent naming format across all counties. 2.1. Unique HEX ID (HEX UID) is added to each entry in the dataset. 2.2. Python_Processing_Script.ipynb script checks informational data relating to dwelling location (ensuring locations exist), date of assessment (which are not in the future, or which were conducted before the establishment of EPC programme), year of construction (which are not in the future), energy rating, and CO2 rating (negative CO2 ratings identified). 2.3. EPC Variable names are updated to correspond with the European Building Stock Observatory standard naming convention. 3. Identify: 3.1. Using the statistical python packages identified and presented in Distribution characteristics and filter rule applied to identify outliers. 3.2. Appropriate data-driven method identifies data which is either erroneous, or which are identified as outliers within the dataset, setting upper and lower bounds which exclude this erroneous/ outlier data. 3.3. Where bimodal data is detected, data is segmented e.g. Floor Area subdivides data by building type and by pre/post thermal year of construction. 4. Apply: 4.1. Outlier limits are applied to all relevant variables, see section 3, table 2. 4.2. All data which is identified as an outlier or erroneous is saved to a separate file (outlier_erroneous_data.csv) for user review. This file is saved to the original save location identified in step 1.2. 4.3. All filtered data is outputted in csv format to the original save location identified in step 1.2 and titled “filtered.csv”.


Technological University Dublin College of Engineering and Built Environment


Energy Efficiency Certificate