Composite Dataset of Input and Output Files from Complex Similarity Network Analysis of Secreted Cysteine-Rich peptides/proteins Without Annotation (SCRs-WA)
Description
This dataset contains a composite collection of bioactive peptide sequences and Complex Similarity Network (CSN) analysis outputs, designed to explore the functional relationships of 1,872 Secreted Cysteine-Rich peptides/proteins Without Annotation (SCRs-WA). The dataset integrates eight peptide classes, including antimicrobial peptides (AMPs), defensins, venoms/toxins, and non-AMP controls, to establish a reference chemical space for functional inference. It includes both input sequence data (FASTA format) and CSN-derived output files, which facilitate the visualization and clustering of peptide sequences based on structural and functional similarities: 1- FileSM1: FileSM1_12449_All_8_datasets.fasta ๐ Content: A FASTA file containing 12,449 peptide sequences across eight datasets: (i) Low-toxicity antimicrobial peptides (AMPs) (ii) Defensins (iii) Animal venoms and toxins (iv) Cytotoxic peptides (v) Haemolytic peptides (vi) Non-AMPs (negative controls) (vii) Cnidarian toxin candidates from S. savaglia (viii) Secreted Cysteine-Rich ORFs Without Annotation (mSCRs-WA) ๐ Usage: - Serves as the primary input dataset for complex similarity network (CSN) analysis. - Enables homology searches, functional annotation, and comparative analyses. ๐ค Output Files from CSN Analysis 2- ๐ FileSM2: FileSM2_HSPN_Topology_GraphML.zip ๐ Content: A compressed ZIP file containing GraphML representations of the Half-Space Proximal Network (HSPN): HSPN_clusters_projection.graphml โ Clustered projection of peptide connectivity based on similarity metrics. HSPN_peptide_classes_projection.graphml โ Projection of peptide classes (AMPs, toxins, defensins, etc.), highlighting their network positioning. ๐ฅ Visualization: Can be opened in Gephi v0.10 or any GraphML-compatible tool. Nodes represent peptide sequences, edges indicate functional similarity, and clusters reflect shared bioactivity profiles. ๐ Usage: - Facilitates visual exploration of sequence relationships. - Enables functional annotation transfer by identifying clusters with known bioactive peptides. 3- ๐ FileSM3: FileSM3_Clusters_Composition_Analysis.xlsx ๐ Content: A spreadsheet detailing cluster composition in the HSPN analysis, including: Cluster ID and size Distribution of peptides across eight datasets Functional annotation insights for each cluster ๐ Usage: - Helps identify key functional groups within the CSN framework. - Provides quantitative insights into peptide distribution and classification. 4- ๐ FileSM4: FileSM4_HSPN_Connections_Analysis.xlsx ๐ Content: A spreadsheet detailing functional connections between peptides, including: Pairwise similarity scores Network centrality measures (e.g., harmonic centrality, degree centrality) Annotations of linked sequences ๐ Usage: - Supports similarity-based functional inference. - Helps track peptide relationships and connectivity patterns within the network.
Files
Steps to reproduce
A curated dataset comprising 12,449 representative peptides has been assembled to approach the mature peptides from a subset of 1,872 Secreted Cysteine-Rich ORFs Without Annotation (SCR-WA), along with 248 cnidarian toxin from Savalia savaglia. This dataset incorporates eight subsets, including well-characterized peptide classes (i-vi), such as: (i) low-toxicity antimicrobial peptides (AMPs), (ii) defensins, (iii) animal venoms and toxins, (iv) cytotoxic peptides, (v) haemolytic peptides, and (vi) non-AMPs (negative controls). The remaining subsets (vii and viii) consist of putative cnidarian toxins and the SCR-WA mature peptides/proteins currently under examination.