Supporting Information: Integrated Machine Learning, QSAR, SHAP, and Molecular Dynamics Pipeline for Peptide Activity Analysis

Published: 7 April 2026| Version 1 | DOI: 10.17632/3yhgz8bdsv.1
Contributor:
Zhibin Yang

Description

This repository contains the comprehensive computational workflow, source code, input datasets, and output results for the identification, evaluation, and modeling of active peptides. The analytical pipeline integrates multi-criteria decision-making (MCDM), Quantitative Structure-Activity Relationship (QSAR) machine learning modeling, SHAP-based interpretability, and Molecular Dynamics (MD) simulation trajectory analysis.

Files

Steps to reproduce

The repository is systematically organized into 8 sequential folders. With the exception of the eighth, each folder is self-contained with Code, Input data, and Output data. 1. Visualization of peptides identification and functional annotation/ R scripts for preprocessing raw peptide mass spectrometry data and generating distribution density plots and composition donut charts. 2. MCDM Analysis/ Scripts for Multi-Criteria Decision Making using TOPSIS and Grey Relational Analysis, coupled with Entropy Weight Methods to rank peptide candidates. 3. Exploratory Data Analysis/ Data consolidation, correlation matrices, sequence conservation logos, and positional amino acid enrichment analysis (Fisher's exact test & Kruskal-Wallis). 4. Feature Engineering/ Python scripts for parsing PLIP (Protein-Ligand Interaction Profiler) XML reports and R pipelines to construct unified mechanistic and comprehensive feature pools (including sequence motifs via FIMO). 5. QSAR Modeling/ Machine learning pipelines using Random Forest (via caret and ranger) to predict Luciferase and IL-6 mRNA inhibition. Includes Recursive Feature Elimination (RFE), hyperparameter tuning, Y-randomization, and Applicability Domain (Williams Plot) evaluation. The QSAR models were developed as internally validated, hypothesis-generating tools for SAR interpretation, rather than as general-purpose predictive models. 6. SHAP Analysis/ Model interpretability analysis using SHAP values (via fastshap and shapviz) to extract global feature importance and generate local waterfall dependency plots. 7. Statistical Visualization of MD/ Post-processing and statistical visualization of Molecular Dynamics (MD) trajectories (GROMACS/MMPBSA formats), including RMSD, RMSF, Rg, SASA, H-bond occupancy, and dynamic hydrophobic distance calculations. 8. PDB files of initial docked conformation and representative MD simulation snapshot/ The pipeline requires R (v4.4.3 or higher) and Python (v3.9 or higher). To fully reproduce the analysis: 1. Clone / Download this repository to your local machine. 2. Path Configuration: The original R/Python scripts may contain specific local path configurations (e.g., setwd("...")). Please adjust the working directories in the scripts to match your local folder structure. 3. Execution Order: It is highly recommended to run the folders in numerical order (1 to 7), as subsequent steps (like feature engineering and QSAR) depend on the consolidated CSV files generated in earlier steps. Folder 8 is used for PyMOL visualization. 4. Hardware Requirements: The Random Forest tuning and Y-randomization steps utilize parallel processing (doParallel). A multi-core CPU is recommended. For detailed information, please refer to the README file.

Institutions

Categories

Machine Learning, Quantitative Structure-Activity Relationship, Molecular Dynamics, Venom Peptide, Omics, Virtual Screening

Licence