QSAR datasets - Meta-QSAR

Published: 30-10-2020| Version 1 | DOI: 10.17632/spwgrcnjdg.1
Contributor:
Ivan Olier

Description

We extracted 2,219 protein targets from ChEMBL with a diverse number of drug-like chemical compounds, ranging from 30 to about 6,000, each target resulting in a dataset with as many examples as compounds. The datasets were originally used in (Olier et al. Meta-QSAR: a large-scale application of meta-learning to drug design and discovery. Machine Learning, 2018, 107 (1), 285-311). Chemical compounds were intrinsically described using a standard fingerprint representation (as it is the most commonly used in QSAR learning), where the presence or absence of a particular molecular substructure in a molecule (e.g. methyl group, benzene ring) is indicated by a Boolean variable. Specifically, we used the RDKit to calculate the 1024 bits FCFP4 fingerprint representation, which is one of the extended-connectivity fingerprints (Rogers and Hahn, 2010) for molecular characterisation. Each dataset consisted of 1,024 input binary variables, one for each fingerprint bit, and one floating-point output variable which represented the chemical compound activities against the target. We used IC50 values, inhibitory drug concentrations at 50%. IC50 value states the concentration of the drug compound that is required to block or inhibit 50% of the proteins. This response data has been normalised by taking the negative log of the drug concentrations that inhibited 50% of a target (pXC50).

Files

Steps to reproduce

We extracted 2,219 protein targets from ChEMBL with a diverse number of drug-like chemical compounds, ranging from 30 to about 6,000, each target resulting in a dataset with as many examples as compounds. The datasets were originally used in (Olier et al. Meta-QSAR: a large-scale application of meta-learning to drug design and discovery. Machine Learning, 2018, 107 (1), 285-311). Chemical compounds were intrinsically described using a standard fingerprint representation (as it is the most commonly used in QSAR learning), where the presence or absence of a particular molecular substructure in a molecule (e.g. methyl group, benzene ring) is indicated by a Boolean variable. Specifically, we used the RDKit to calculate the 1024 bits FCFP4 fingerprint representation, which is one of the extended-connectivity fingerprints (Rogers and Hahn, 2010) for molecular characterisation. Each dataset consisted of 1,024 input binary variables, one for each fingerprint bit, and one floating-point output variable which represented the chemical compound activities against the target. We used IC50 values, inhibitory drug concentrations at 50%. IC50 value states the concentration of the drug compound that is required to block or inhibit 50% of the proteins. This response data has been normalised by taking the negative log of the drug concentrations that inhibited 50% of a target (pXC50).