Dataset for QSAR Modeling of DPP-4 Inhibitors

Published: 2 November 2019| Version 2 | DOI: 10.17632/4sw5hr2yz7.2


The data that used for DPP-4 inhibitors QSAR modeling with machine learning methods were raw data from ChEMBL, initial processing data, and data partitioning for modeling which is divided into 3 parts i.e. training data, validation data (internal validation) and data test (external validation). This dataset also contained an explanation of the methods and tools used for modeling, and the predicted results of pIC50 values from the best model (from external validation data).


Steps to reproduce

This research was used hardware: Intel Pentium Core i7 3.07 GHz 64-bit PC with 24 GB RAM. The software used in this study was KNIME Version 3.7.1. QSAR modeling begins with preprocessing first, then the features were calculated. Feature selection has done to produce the best features that will be used for modeling. reduction with Random Forest produces better features compared to other features in this modeling. the data was partitioned into 80% for cross-validation and 20% for tests. the results of cross-validation and external validation of the best model were SVR compared to Deep Learning, Random Forest, XGBoot tree ensemble and Multiple Linear Regression models. In addition, we look for potential fragments from the DPP-4 database that were found in many compounds with IC50 activity below 50 nM and only a few in compounds with activity above 50 nM. The potential fragment produced can be a marker of a potential compound for screening results and can be developed for the development of new compounds.


Universitas Indonesia Fakultas Farmasi


Machine Learning, Support Vector Machine, Quantitative Structure-Activity Relationship, Deep Learning