Data for: Variant effect predictions capture some aspects of deep mutational scanning experiments

Published: 14 Feb 2020 | Version 1 | DOI: 10.17632/2rwrkp7mfk.1
Contributor(s):

Description of this data

Primary analysis files for bioRxiv manuscript with id 2019/859603 (https://www.biorxiv.org/content/10.1101/859603v1) to evaluate how common variant effect prediction methods capture effect determined by deep mutational scanning experiments.

'data' contains the deep mutational scanning data in a parsed format. See the manuscript for the original data sources which would then be processed with parseRawDatasets.py, followed by manual sequence mapping (resulting in the mapped_seqs.txt files) and then be processed with parseScores.py to result in the .npz files.
'predictionData' contains predictions from SIFT, PolyPhen-2, SNAP2 and Envision, parsed into .npz files. Additional folders are for dummy methods and while executing the below scripts.
'analysis' will contain most of the output files.

See below for sample calls to reproduce e.g. Figure 1 from the paper. The scripts are written in Python3 and require, among others, numpy, pandas, scipy, sklearn, rpy2, svgutils and matplotlib.

For all scripts the --normalization-scheme flag describes how the experimental scores are processed to fit on the same scale of values. The scheme used for the final manuscript is 'wt0_del_scaled' for deleterious effect variants and 'wt0_ben_scaled' for beneficial effect variants.
For compareBinaryDMSToPredictions.py the --binarization-scheme flag describes how scores are binarized to neutral/effect. Possible values are the schemes outlined in the manuscript 'syn90', 'syn95' and 'syn99'.

Experiment data files

  • analysis
    Cite
    • binary
      Cite
  • data
    Cite
    • Adkar2012_CcdB
      Cite
    • Araya2012_YAP1
      Cite
    • Brenan2016_MAPK1
      Cite
    • Findlay2018_BRCA1
      Cite
    • Firnberg2014_TEM1
      Cite
    • Heredia2018_CCR5
      Cite
    • Heredia2018_CXCR4
      Cite
    • Hietpas2011_HSP90
      Cite
    • Hietpas2013_HSP90
      Cite
    • Jiang2013_HSP90
      Cite
    • Kitzman2014_Gal4
      Cite
    • Klesmith2015_LGK
      Cite
    • Majithia2016_PPARG
      Cite
    • Matreyek2018_PTEN
      Cite
    • Matreyek2018_TPMT
      Cite
    • RockahShmuel2015_MTH3
      Cite
    • Romero2015_Bgl3
      Cite
    • Roscoe2013_Ubiquitin
      Cite
    • Roscoe2014_Ubiquitin
      Cite
    • Sarkisyan2016_GFP
      Cite
    • Starita2013_UBE4B
      Cite
    • Starita2015_BRCA1
      Cite
    • Stiffler2015_TEM1
      Cite
    • Traxlmayr2012_IgG1_CH3
      Cite
  • datasets
    Cite
    • annotated
      Cite
  • prediction_data
    Cite
    • Adkar2012_CcdB
      Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite
    • Araya2012_YAP1
      Cite
      • envision_db
        Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite
    • Brenan2016_MAPK1
      Cite
      • envision_db
        Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite
    • Findlay2018_BRCA1
      Cite
      • envision_db
        Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite
    • Firnberg2014_TEM1
      Cite
      • psiblast_raw
        Cite
      • snap2
        Cite
    • Heredia2018_CCR5
      Cite
      • envision_db
        Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite
    • Heredia2018_CXCR4
      Cite
      • envision_db
        Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite
    • Hietpas2011_HSP90
      Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite
    • Hietpas2013_HSP90
      Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite
    • Jiang2013_HSP90
      Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite
    • Kitzman2014_Gal4
      Cite
      • envision_db
        Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite
    • Klesmith2015_LGK
      Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite
    • Majithia2016_PPARG
      Cite
      • envision_db
        Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite
    • Matreyek2018_PTEN
      Cite
      • envision_db
        Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite
    • Matreyek2018_TPMT
      Cite
      • envision_db
        Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite
    • RockahShmuel2015_MTH3
      Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite
    • Romero2015_Bgl3
      Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite
    • Roscoe2013_Ubiquitin
      Cite
      • psiblast_raw
        Cite
      • snap2
        Cite
    • Roscoe2014_Ubiquitin
      Cite
      • psiblast_raw
        Cite
      • snap2
        Cite
    • Sarkisyan2016_GFP
      Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite
    • Starita2013_UBE4B
      Cite
      • envision_db
        Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite
    • Starita2015_BRCA1
      Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite
    • Stiffler2015_TEM1
      Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite
    • Traxlmayr2012_IgG1_CH3
      Cite
      • pph
        Cite
      • pphbin
        Cite
      • psiblast_raw
        Cite
      • sift
        Cite
      • snap2
        Cite

Steps to reproduce

To create the marginal plots (e.g. Figure 1) and Figure 2
F:\girepos\mendeley_data\prediction_data\compareDMSToPredictions.py -p F:\girepos\dms-variant-analysis\data\ -pp F:\girepos\mendeley_data\prediction_data\ -fd main -f F:\girepos\mendeley_data\data\FixedDatasets.txt -n wt0_del_scaled -po F:\girepos\mendeley_data\analysis\ -v -pl -st

Use a different normalization scheme (-n wt0_ben_scaled) to create all data based on beneficial effect variants (Figure S4a-e). Running the script will also create plots for every single DMS dataset in the directories in 'prediction_data' (Figures S3 and S5).

To create ROC plots and calculate AUCs (e.g. Figure 4)
F:\girepos\mendeley_data\prediction_data\compareBinaryDMSToPredictions.py -pp F:\girepos\mendeley_data\prediction_data\ -p F:\girepos\mendeley_data\data\ -po F:\girepos\mendeley_data\analysis\binary\ -f F:\girepos\mendeley_data\data\FixedDatasets.txt -fd main_binary -n syn95_del -pl -st -v

Both scripts can be run with the -fbs flag for faster bootstrapping (for example for quick test runs), however all results obtained from bootstrapping will then not be meaningful.

To analyze the agreement between deep mutational scanning experiments on the same protein as well as predictions of those (Figure 3)
F:\girepos\mendeley_data\data\analyzeExperimentalAgreement.py -p F:\girepos\mendeley_data\data\ -po F:\girepos\mendeley_data\analysis\ -n wt0_del_scaled -pl -v

Latest version

  • Version 1

    2020-02-14

    Published: 2020-02-14

    DOI: 10.17632/2rwrkp7mfk.1

    Cite this dataset

    R, Jonas (2020), “Data for: Variant effect predictions capture some aspects of deep mutational scanning experiments”, Mendeley Data, v1 http://dx.doi.org/10.17632/2rwrkp7mfk.1

Statistics

Views: 0
Downloads: 2185

Categories

Genetic Variation

Licence

CC BY 4.0 Learn more

The files associated with this dataset are licensed under a Creative Commons Attribution 4.0 International licence.

What does this mean?
You can share, copy and modify this dataset so long as you give appropriate credit, provide a link to the CC BY license, and indicate if changes were made, but you may not do so in a way that suggests the rights holder has endorsed you or your use of the dataset. Note that further permission may be required for any content within the dataset that is identified as belonging to a third party.

Report