Data for: Variant effect predictions capture some aspects of deep mutational scanning experiments

Published: 14-02-2020| Version 1 | DOI: 10.17632/2rwrkp7mfk.1
Jonas Reeb


Primary analysis files for bioRxiv manuscript with id 2019/859603 ( to evaluate how common variant effect prediction methods capture effect determined by deep mutational scanning experiments. 'data' contains the deep mutational scanning data in a parsed format. See the manuscript for the original data sources which would then be processed with, followed by manual sequence mapping (resulting in the mapped_seqs.txt files) and then be processed with to result in the .npz files. 'predictionData' contains predictions from SIFT, PolyPhen-2, SNAP2 and Envision, parsed into .npz files. Additional folders are for dummy methods and while executing the below scripts. 'analysis' will contain most of the output files. See below for sample calls to reproduce e.g. Figure 1 from the paper. The scripts are written in Python3 and require, among others, numpy, pandas, scipy, sklearn, rpy2, svgutils and matplotlib. For all scripts the --normalization-scheme flag describes how the experimental scores are processed to fit on the same scale of values. The scheme used for the final manuscript is 'wt0_del_scaled' for deleterious effect variants and 'wt0_ben_scaled' for beneficial effect variants. For the --binarization-scheme flag describes how scores are binarized to neutral/effect. Possible values are the schemes outlined in the manuscript 'syn90', 'syn95' and 'syn99'.


Steps to reproduce

To create the marginal plots (e.g. Figure 1) and Figure 2 F:\girepos\mendeley_data\prediction_data\ -p F:\girepos\dms-variant-analysis\data\ -pp F:\girepos\mendeley_data\prediction_data\ -fd main -f F:\girepos\mendeley_data\data\FixedDatasets.txt -n wt0_del_scaled -po F:\girepos\mendeley_data\analysis\ -v -pl -st Use a different normalization scheme (-n wt0_ben_scaled) to create all data based on beneficial effect variants (Figure S4a-e). Running the script will also create plots for every single DMS dataset in the directories in 'prediction_data' (Figures S3 and S5). To create ROC plots and calculate AUCs (e.g. Figure 4) F:\girepos\mendeley_data\prediction_data\ -pp F:\girepos\mendeley_data\prediction_data\ -p F:\girepos\mendeley_data\data\ -po F:\girepos\mendeley_data\analysis\binary\ -f F:\girepos\mendeley_data\data\FixedDatasets.txt -fd main_binary -n syn95_del -pl -st -v Both scripts can be run with the -fbs flag for faster bootstrapping (for example for quick test runs), however all results obtained from bootstrapping will then not be meaningful. To analyze the agreement between deep mutational scanning experiments on the same protein as well as predictions of those (Figure 3) F:\girepos\mendeley_data\data\ -p F:\girepos\mendeley_data\data\ -po F:\girepos\mendeley_data\analysis\ -n wt0_del_scaled -pl -v