Into the research multiverse: How decisions about data can affect the results of biomedical studies that integrate community-level data

Published: 1 October 2024| Version 1 | DOI: 10.17632/95954hvxvj.1
Contributors:
Lily Cook, Titus Schleyer,

Description

Integrating electronic health record data with environmental data has the potential to enrich biomedical research with new insights into the relationship between health and environment. However, the data preparation process carries implications that have not been fully explored. The objectives of this study were to (a) determine whether and how different data preparation decisions in the same integrated dataset affected the results of the analyses and (b) identify which decisions introduced the most variability. For this study, we repurposed a dataset from a prior study that examined the association between poor air quality days caused by wildfire smoke and pulmonary exacerbations in people with cystic fibrosis. The clinical dataset was created by querying the Cystic Fibrosis Foundation Patient Registry and pulling the data of patients treated at Oregon Health & Science University’s Cystic Fibrosis Care Center and Doernbecher Children’s Hospital from 2010 to 2019 (inclusive). Community-level data about fine particulate matter (PM2.5) was obtained from the EPA’s Air Quality System DataMart. We developed an algorithm that ran the same dataset through a variety of plausible decisions in preparing the data and generated the same statistical output for each analysis. We compared point estimate odds ratios, confidence intervals, and p-values and evaluated how data preparation approaches affected the characteristics of resulting patient cohorts. A total of 135 data preparation pathways generated 93 unique odds ratios, of which 26 appeared more than once in the results. The resulting odds ratios ranged from 0.83 to 2.93, with a mean of 1.31 (SD ±0.37). More than half (50.37%) of the results had a p-value ≤0.05. Different data preparation decisions removed up to 87.23% of patients and 93.51% of patient days. The percentage of patient days contributed by patients living in urban areas varied between 67.54% and 98.73%.

Files

Steps to reproduce

The clinical dataset used in this study cannot be publicly shared because it contains protected health information. However, we are able to share the community-level data that was used in the analysis, the code that we used to process our data, and to provide a data dictionary of the clinical data so that others utilize their own dataset to reproduce our methods. These files are intended to be used together with a patient cohort to generate an analysis. We developed an algorithm in Python, available here as Data_Prep_Python_Code.ipynb, that prepared the same dataset in a variety of different ways. The algorithm was created by closely examining the data preparation process and then splitting data preparation decisions into four integral steps: (1) decisions about validating participant addresses, (2) decisions about how to handle missing addresses, (3) decisions about classifying exposures, and (4) decisions about lag times. Once the algorithm selected an option from each of the four steps, it then generated the result using Fisher’s exact test. The selection continued until the algorithm had generated a result for every possible pathway through the data preparation process, so that each pathway was a different combination of options that could be selected at each step (Figure 1). We then compared the results by grouping the results from each step together, taking the mean of the odds ratios that resulted from the transformation conducted at that step. Finally, we characterized the effect size of each decision by calculating the standardized mean difference using Hedges’ g [33–35]: g= (μ_(i1 - ) μ_i0)/σ_i . Here, σ_i is the pooled sample standard deviation, s_(iP )= √(((n_i0 - 1) 〖s^2〗_(i0 )+ (n_i1 - 1) 〖s^2〗_(i1 ) )/(n_(i0 )+ n_i1 - 2)) , (1) while μ_i0, s2i0 , and ni0 are the sample mean, sample variance, and number of samples for the control group, and μ_i1, s2i1 , and n_(i1 )are the sample mean, sample variance, and number of samples for the treatment group. The grouped odds ratios for each step were considered the “treatment” group, while the pooled mean odds ratio for all results was used as the “control.” All data cleaning, validation, statistical analysis, and figures were documented in a Python notebook and conducted in Python 3.7.

Categories

Algorithms, Data Quality, Environmental Informatics, Clinical Research Informatics, Public Health Informatics

Licence