Spatial Clusters of Childhood Cancer: Benchmarking Data. As published in Schündeln et al. 2021 Cancer Epidemiology & Data in Brief

Published: 04-01-2021| Version 5 | DOI: 10.17632/3hrg9tpsx9.5
Michael Schündeln


Incidence of newly diagnosed childhood cancer (140/1,000,000 children under 15 years) and nephroblastoma (7/1,000,000) was simulated. Clusters of defined size (1-50) were randomly assembled on the district level in Germany. Each cluster was simulated with ten different relative risk levels (1 to 100). For each combination 2000 iterations were done. Simulated data was then analysed by three local clustering tests: Besag-Newell (BN) method, spatial scan statistic (SSS) and Bayesian Besag-York-Mollié with Integrated Nested Laplace Approximation approach (BYM). See references for published manuscripts. RAW DATA: The simulated raw data is reported in the Rdata files: "AllMalignancies.Rdata" and " NephroblastomaSimulation.Rdata". These files contain 6 lists for the different cluster sizes ("Cluster Size X"). Within each of these lists 2000 simulations for clusters in 10 different risk levels ("RR Y Cluster") and the corresponding simulated cases for each of the respective scenario ("RR Y SimCases") are found. In addition, each file contains the population of children under 15 years for each district (“District Population”) and the expected cases for the entities, all cancer or nephroblastoma, (“Expected Cases”) per district. Adjacency matrix for the 402 German districts is added as separate Rdata file. The code and the GADM shape files to reproduce the original simulation and published study at: ANALYZED DATA: Operating characteristics of each of the various cluster detection methods and scenarios in this study is reported according to the quality criteria detailed below ("Analyzed Data.xlsx") Minimum Power (MP): Proportion of simulations detecting at least one district of the true cluster Exact Power (EP): Proportion of simulations detecting the true cluster without false positives Sensitivity (sens): Proportion of correctly detected districts in the true cluster Specificity (spec): Percentage of normal risk districts, correctly classified as normal risk districts Positive predictive value (PPV): Proportion of districts in the detected cluster belonging to the true cluster Negative predictive value (NPV): Proportion of districts not labeled as a risk cluster that is not part of the true cluster Correct classification (CC): Percentage of correctly classified districts of all districts Correct proportion (CP): Correctly labeled districts of all detected potential HR districts Positive diagnostic likelihood (PDL): The ratio of high-risk districts being detected, divided by the probability non-HR districts being detected Negative diagnostic likelihood (NDL): The ratio of high-risk districts not being detected divided by the probability of non-high-risk districts not being detected False positive rate (FPR): Incorrectly labeled high-risk districts of all detected high-risk districts False negative rate (FNR): Incorrectly labeled normal-risk districts of all detected normal-risk districts


Steps to reproduce

Download source code at: The three files from the repository are needed to reproduce data and run the simulation: 1. R-Code: "Code Spacial Cluster Childhood Cancer": Code 2. Population Data "Cluster.xlsx" 3. GADM Files Germany are needed for the Spatial Polygon Maps The simulation as saved in the reporitory (for test runs) runs in about 1 minute. Resulting files: 1. Results from the Cluster detection tests ("xxxxRESULTS.xlsx") 2. Raw data from the simulation ("xxxxSIMULATEDDATA.Rdata") In lines 40 – 63 the parameters of the simulation can be modified as needed (e.g. simulation runs, RR levels, cluster-size, etc.).