Machine learning for corrosion database
This database was firstly created for the scientific article entitled: "Reviewing Machine Learning of corrosion prediction: a data-oriented perspective" L.B. Coelho 1 , D. Zhang 2 , Y.V. Ingelgem 1 , D. Steckelmacher 3 , A. Nowé 3 , H.A. Terryn 1 1 Department of Materials and Chemistry, Research Group Electrochemical and Surface Engineering, Vrije Universiteit Brussel, Brussels, Belgium 2 A Beijing Advanced Innovation Center for Materials Genome Engineering, National Materials Corrosion and Protection Data Center, Institute for Advanced Materials and Technology, University of Science and Technology Beijing, Beijing, China 3 VUB Artificial Intelligence Lab, Vrije Universiteit Brussel, Brussels, Belgium Different metrics are possible to evaluate the prediction accuracy of regression models. However, only papers providing relative metrics (MAPE, R²) were included in this database. We tried as much as possible to include descriptors of all major ML procedure steps, including data collection (“Data acquisition”), data cleaning feature engineering (“Feature reduction”), model validation (“Train-Test split”*), etc. *the total dataset is typically split into training sets and testing (unknown data) sets for performance evaluation of the model. Nonetheless, sometimes only the training or the testing performances were reported (“?” marks were added in the respective evaluation metric field(s)). The “Average R²” was sometimes considered for studies employing “CV” (cross-validation) on the dataset. For a detailed description of the ML basic procedures, the reader could refer to the References topic in the Review article.
Steps to reproduce
The “Machine learning for corrosion database” comprising the references [1-19] was built into a Pandas DataFrame (Python language). The database and associated code are available online at GitHub (Jupyter Notebook files). The target property of corrosion prediction varies and is usually described by many descriptors. For example, the corrosion potential “E (mV)” might be an input (“Selected features”) and/or an output (“Targets”). The “Feature importance” column was used when formal feature importance methods were employed. Regarding the “Ensemble” type of models (“Type of ML model” column), these do not refer to combinations of models with feature engineering (for example, “PCA-RF” was considered as a “Tree-based” type of model, and not an “Ensemble”). For a summarized description of the ML models here discussed, the reader should refer to the “Models description” feature. For plotting the Fig. 7 of the Review article, the following considerations were applied to the “Targets” attribute: when the corrosion rate “CR” presented “µA/cm²” as a unit, it was considered as “I” (current); “material loss” represents “weight loss” or “volume loss”; “depth” stands for “defect growth” or “corrosion depth”; and “crack growth rate” was considered as “CR”. The last columns of the DataFrame condensate the gathered knowledge from the references in the following form: “Premises/challenges”, “Achievements”, “Contributions”, and “Critical points”.