Synthetic dataset on eco-innovation for handling missing data

Published: 19 September 2025| Version 1 | DOI: 10.17632/v88pwnjz79.1
Contributors:
,
,

Description

This dataset article describes the curation and preprocessing of the 2024 Eco-Innovation Index (EII) dataset, published by the European Commission. The raw dataset (in .xlsx format) was filtered to focus on the 2024 report, and missing values in the "Water Productivity" indicator were addressed via two imputation methods: (1) EU27 mean substitution and (2) cluster-based mean imputation using K-means, an unsupervised machine learning algorithm.

Files

Steps to reproduce

The Eco-Innovation Index 2024 dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0), that means it can be used if appropriate credit is given and changes are indicated. After obtaining the Eco-Innovation Index 2024 dataset (European Commission, 2025), it was preprocessed to make it suitable for use, therefore 3 filters were applied: - Year: Retained only 2024 data, removing previous years' results (2014-2023) to focus the analysis scope. - Indicator: Selected only the 12 core EII indicators. - Country: Removed EU27 aggregate values, which represent combined European Union results for specific indicators, themes or the composite EII score. Following filter application, a new excel worksheet was created and the filtered data was copied to preserve the filtered state without maintaining active excel filters. Then the following columns were removed: iso2, year, indicator_code, perf, and type, retaining only: country, value and indicator. Subsequently, the indicator column was pivoted to create 12 separate columns, transforming the dataset into a 28x13 matrix (27 EU Member States + header row x country names + 12 indicators), generating the Initial dataset. Missing values for five countries (Austria, Finland, Ireland, Italy, Portugal) were identified in the Water Productivity (WP) indicator. Three approaches were implemented, resulting in datasets A, B and C: Dataset A: cells filled with the average of the Water Productivity indicator. Dataset B: dataset without the Water Productivity indicator column; Processed through k-means clustering. Dataset C: cells filled with the average of the results of the cluster to which each country belongs in the Water Productivity indicator; was generated by dataset B. Datasets A and C are suitable for later comparison and study regarding missing data imputation.

Institutions

Instituto Nacional de Tecnologia, Universidade Federal Fluminense

Categories

Sustainability, Innovation, Missing Data, k-means Clustering, Green Innovation

Licence