A Consensus of In-silico Sequence-based Modeling Techniques for Compound-Viral Protein Activity Prediction for SARS-COV-2

Published: 03-11-2020| Version 3 | DOI: 10.17632/8rrwnbcgmx.3
Raghvendra Mall


Here we provide the datasets used for training and testing of the end-to-end supervised deep learning models as well as the datasets used with vector representations of compounds and proteins and passed to supervised state-of-the-art machine learning models (XGBoost, RF, SVM). We also provide the full list of viral proteins with their sequences used for the protein autoencoder along with the list of SMILES representations of compounds used for the compound autoencoder. Furthermore, we provide pickle files of data obtained from NCBI assay and compound-viral protein interactions downloaded through ChEMBL. The compound-viral protein interactions after filtering from both NCBI and ChEMBL. The list of compounds tested against the three main proteases of coronavirus and the three main proteases of SARS-COV-2 as a fasta file. All the test files associated with SARS-COV-2 viral proteins for end-to-end deep learning models as well as vector representation based supervised machine learning models.


Steps to reproduce

Follow the README of the repository https://github.com/raghvendra5688/Drug-Repurposing/tree/master/data for details of reproducing the datasets