Raw dataset and R scripts for: Unravelling spatial drivers of topsoil total carbon variability in tropical paddy soils of Sri Lanka

Published: 20 December 2023| Version 3 | DOI: 10.17632/dnw2v82r8y.3


This data set represents the raw dataset, raster files associated with the environmental covariates used for modelling, and the R script that describes the flow of analyses used for the research article entitled: Unravelling the spatial drivers of topsoil total carbon concentration variability in paddy-growing soils in tropical agro-ecosystems of Sri Lanka. This study specifically aimed at identifying the spatial drivers and estimates of total carbon (TC) concentration in topsoil (0-15 cm) across the paddy-growing regions in tropical climates using Sri Lanka as a case study. Two distinct sampling strategies were used to collect soil samples for model calibration and validation purposes. For model calibration, a total of 888 sampling locations were sampled using a conditioned Latin Hypercube sampling approach. Additionally, 99 sampling sites were selected using a design-based stratified random strategy for independent evaluation of the developed models. Total carbon concentration (%) was analysed using an automated dry combustion method via a 2400 Series II CHN Elemental Analyser. Geospatial modelling of TC concentration was carried out through two distinct random forest models using a variety of environmental covariates. The environmental covariates used for the current analyses includes; mean annual rainfall (Rainfal_N), annual average mean temperature (Temp_N), annual average minimum temperature (Temp_Min_N), annual average maximum temperature (Temp_Max_N), vapour pressure deficient (VPD_N), MODIS enhanced vegetation index (Modis_N), SAGA wetness index (SAGA_WI_N), slope angle (Slope_d_N) and elevation (DEM_N). All environmental covariates were resampled to a spatial resolution of 100 m prior to spatial analysis. Furthermore, we deployed a novel area of applicability (AOA) calculation to quantify and identify regions where the current prediction is less reliable. In addition to AOA analysis, the uncertainty of TC prediction (%) was calculated at a 90% prediction interval. The influence of increasing the number of calibration sites on model prediction quality and reliability was assessed by using a user-defined sequence of calibration sites (e.g. n=200, n=300, n=400, n=500, n=600, n=700, n=800, n=888). For more information on the study area, sampling design, analytical data generation, modelling, and interpretation of the data, please refer to the Research article mentioned above.



University of Sydney, Commonwealth Scientific and Industrial Research Organisation, National Institute of Fundamental Studies


Machine Learning, Spatial Modeling, Soil Carbon, Sri Lanka, Paddy Soil, Digital Soil Mapping


National Research Council Sri Lanka

NRC 17-011