Dataset and code for "Priority PFAS and control areas in the Taihu Basin based on genotoxicity risk and industrial source identification"

Name: Dataset and code for "Priority PFAS and control areas in the Taihu Basin based on genotoxicity risk and industrial source identification"
Creator: 王 新怡
Published: 2026-05-26T11:44:19.767Z
Keywords: Machine Learning Algorithm

新怡, 王

doi:10.17632/42g93cwx7b.1

Dataset and code for "Priority PFAS and control areas in the Taihu Basin based on genotoxicity risk and industrial source identification"

Published: 26 May 2026| Version 1 | DOI: 10.17632/42g93cwx7b.1

Contributor:

王新怡

Description

This database contains the raw data used in this study and the key codes employed for machine-learning analysis. The raw data files include the compiled dataset used for machine-learning model development and analysis; the industrial land area and perimeter data for the three sub-regions of the Taihu Basin; PFAS toxicity data related to fish transcriptomic disruption exported from the EPA database, which were used to construct the species sensitivity curve for PFAS-induced fish transcriptomic disruption; and a vector file containing the spatial coordinates, area, and other attribute information of all industrial zones, which can be imported into ArcGIS for visualization and spatial analysis. The code files include the machine-learning scripts for three tree-based models; SHAP interpretation scripts for the three tree-based models based on PermutationExplainer; SHAP analysis scripts for the XGBoost model based on TreeExplainer; and a script for calculating 10-fold cross-validation R² to compare the stability of the three models.

Dataset and code for "Priority PFAS and control areas in the Taihu Basin based on genotoxicity risk and industrial source identification"

Description

Files

Categories

Licence