Biased sampling confounds machine learning prediction of antimicrobial resistance
Description
Antimicrobial resistance (AMR) poses a growing threat to human health. Increasingly, genome sequencing is being applied for surveillance of bacterial pathogens, producing a wealth of data to train machine learning (ML) applications to predict AMR and identify resistance determinants. However, bacterial populations are highly structured and sampling is biased towards human disease isolates, meaning samples and derived features are not independent. This is rarely considered in applications of ML to AMR. Here, we demonstrate the confounding effects of sample structure by collecting over 24,000 whole genome sequences and AMR phenotypes from five diverse pathogens and constructing realistic pathological training data where resistance is confounded with phylogeny. We show resulting ML models perform poorly, and increasing the training sample size fails to rescue performance. A comprehensive analysis of 6,740 models identifies species- and drug-specific effects on model accuracy. We provide concrete recommendations for evaluating future ML approaches to AMR.
Files
Steps to reproduce
Here store the machine learning results and (large) files required to train the models. To run the scripts for schemes A and B and individual clade interpretation provided in GitHub (https://github.com/BarquistLab/AMR_prediction.git), please use the "gene_presence_absence.Rtab" and "SPECIES.vcf" file as data input, "meta_checkm.csv" as meta data input, one of the "train_test_split_ANTIOBIOTIC.csv" in "train_test_split_tables" folder, and specify the antibiotic using -A. To interpret the model performance, run python performance_interpretation.py TS2.tsv