Biased sampling confounds machine learning prediction of antimicrobial resistance

Name: Biased sampling confounds machine learning prediction of antimicrobial resistance
Creator: Yanying Yu
Published: 2025-01-06T18:17:32.614Z
Keywords: Population Structure, Machine Learning, Antimicrobial Resistance, Antibiotic Resistance

Yu, Yanying; Wheeler, Nicole; Barquist, Lars

doi:10.17632/zs2mbjv7dn.1

Biased sampling confounds machine learning prediction of antimicrobial resistance

Published: 6 January 2025| Version 1 | DOI: 10.17632/zs2mbjv7dn.1

Contributors:

Yanying Yu, Nicole Wheeler, Lars Barquist

Description

Antimicrobial resistance (AMR) poses a growing threat to human health. Increasingly, genome sequencing is being applied for surveillance of bacterial pathogens, producing a wealth of data to train machine learning (ML) applications to predict AMR and identify resistance determinants. However, bacterial populations are highly structured and sampling is biased towards human disease isolates, meaning samples and derived features are not independent. This is rarely considered in applications of ML to AMR. Here, we demonstrate the confounding effects of sample structure by collecting over 24,000 whole genome sequences and AMR phenotypes from five diverse pathogens and constructing realistic pathological training data where resistance is confounded with phylogeny. We show resulting ML models perform poorly, and increasing the training sample size fails to rescue performance. A comprehensive analysis of 6,740 models identifies species- and drug-specific effects on model accuracy. We provide concrete recommendations for evaluating future ML approaches to AMR.

Files

Steps to reproduce

Here store the machine learning results and (large) files required to train the models. To run the scripts for schemes A and B and individual clade interpretation provided in GitHub (https://github.com/BarquistLab/AMR_prediction.git), please use the "gene_presence_absence.Rtab" and "SPECIES.vcf" file as data input, "meta_checkm.csv" as meta data input, one of the "train_test_split_ANTIOBIOTIC.csv" in "train_test_split_tables" folder, and specify the antibiotic using -A. To interpret the model performance, run python performance_interpretation.py TS2.tsv

Biased sampling confounds machine learning prediction of antimicrobial resistance

Description

Files

Steps to reproduce

Categories

Licence