Supplementary Materials for Optimizing Real-Time Phenotyping in Critical Care Using Machine Learning on Electronic Health Records
Description
This dataset accompanies the study "Optimizing Real-Time Phenotyping in Critical Care Using Machine Learning on Electronic Health Records," which hypothesizes that a patient's latent disease state can be continuously and accurately estimated from real-time biomedical signals without requiring full ICU trajectories. It supports replication and evaluation of our predictive framework, which dynamically models phenotype probabilities as data accumulates. All elements are reported in line with the TRIPOD statement to ensure transparency and reproducibility. The training and test data are derived from the MIMIC-IV database and consist of vectorized representations of multivariate, irregularly sampled biomedical time series and associated phenotype labels. These were generated through a structured pipeline that includes cohort selection, event aggregation using fixed-length time bins, and feature engineering to represent both value trends and missingness. Supplementary Tables S.1 to S.6 describe the variables used in this transformation, their sources within the EHR, aggregation methods, and descriptive statistics for both static (e.g., demographics, admission data) and dynamic (e.g., vital signs, lab results, ventilator settings) features across the train and test sets. Table S.7 summarizes the model’s real-time phenotyping performance using multiple evaluation perspectives. The results reveal strong generalization and early predictive value: in the (ls) setting, the model achieved good diagnostic performance (AUROC ≥ 0.8) for 69% of phenotypes and excellent performance (AUROC ≥ 0.9) for 30%. In the real-time (fs) setting—using only the earliest recorded physiological data—the model still achieved good performance for 40% of phenotypes and excellent performance for 5%, demonstrating the feasibility of early, actionable phenotyping. The intermediate (td) evaluation shows that predictive quality improves consistently as more data becomes available, supporting the framework’s ability to track dynamic disease progression in real time. To interpret and use the data: - Each patient stay is represented as a multivariate time series with associated phenotype labels. - Time series are aligned in fixed time intervals (e.g., 2 hours), where each variable is aggregated using statistical functions (e.g., mean, last, sum). - The phenotype labels correspond to ICD-9-CM diagnostic categories assigned at discharge but are used here as latent variables to be estimated continuously. This dataset enables reproducibility of the results and further research in developing machine learning models for early, interpretable, and actionable phenotyping in critical care.