lightbgm_shapley code

Published: 19 January 2026| Version 1 | DOI: 10.17632/bxmdt72wp2.1
Contributor:
Dongling Ma

Description

In this study, we employed a LightGBM-based machine learning framework to analyze the response of soil erosion to extreme precipitation. The workflow consisted of several key steps: 1. Data preparation and splitting: The dataset was read from a CSV file, with the last column defined as the target variable and the remaining columns as predictors. Data were randomly split into training and testing sets (80%–20%) using a fixed random seed to ensure reproducibility. 2. Hyperparameter optimization: To improve model performance, the hyperparameters of the LightGBM model (including num_leaves, learning_rate, feature_fraction, min_child_weight, subsample, and colsample_bytree) were optimized using Hyperopt with a tree-structured Parzen estimator (TPE) algorithm. Five-fold cross-validation was applied on the training set, and the mean RMSE was used as the objective for optimization. A total of 1000 evaluations were performed to identify the best combination of hyperparameters. 3. Model training and evaluation: Using the optimized parameters, the LightGBM model was trained under five-fold cross-validation on the entire dataset to assess predictive performance. The model was evaluated using RMSE, MAE, and R² metrics. Finally, the model was retrained on the full dataset to obtain a final predictive model, and its accuracy was verified on the held-out test set. 4. SHAP-based interpretation: To interpret the contribution of each predictor to soil erosion, we employed the SHAP (SHapley Additive exPlanations) framework. Both global (summary, bar, and beeswarm plots) and local (dependence, waterfall, and force plots) explanations were generated to reveal the relative importance and nonlinear interactions of the driving factors. The mean absolute SHAP values of each predictor were calculated and visualized to quantify their overall contributions. 5.Visualization and reproducibility: All SHAP-based plots and the feature importance table were saved for further analysis. This framework allows flexible adaptation to datasets of different spatial and temporal scales, ensuring robustness and reproducibility.

Files

Categories

Code Breaking

Licence