Data for: Local Search for Constrained Graph Clustering in Biological Networks

Published: 24-09-2020| Version 2 | DOI: 10.17632/572nyx5rbs.2
Contributor:
Duy Hoang Tran

Description

This work addresses a constrained graph clustering problem with applications on Protein-Protein Interaction (PPI) networks in cancer research. The method takes a subgraph of the PPI network as input and facilitates the identification of pathways in PPI networks, an important aid in identifying cancer driver genes from gene prioritization lists. Interaction graph G_i = (V, E_i) is an unweighted graph, where V represents a set of genes and E_i describes the interaction between each pair of genes. Two other weighted graphs express which pair of genes must or cannot be assigned to the same pathway: cannot-link graph G_c = (V, E_c) and must-link graph G_m = (V, E_m). The edges in E_c describe the penalty score for each pair of genes that should not be present in the same pathway, while the penalty for the pair which must belong to the same pathway is represented by E_m. The goal is to partition V into k groups such that each subset is a connected component in the interaction graph and the penalty of cannot-link and must-link constraint violation is minimized. There are three sets of instances: small-instance set (n <= 300), medium-instance set (n <= 2000) and large-instance set (n > 10000). The set of small instances includes 54 real problem instances extracted from the HINT+HI2012 PPI network that belong to six different graph sizes: 50, 75, 100, 150, 200, and 300 nodes. The medium-instance set contains 60 synthetic problem instances having n in {500, 750, 1000, 1250, 1500, 2000}. An n-size interaction network is formed from the physical protein-protein interaction network for human [1] by taking n random connected nodes and transferring all incident edges. The set of large instances includes 102 synthetic problem instances using original graphs from a collection of species-specific protein-protein association networks [2] whose sizes are from 10,000 to 40,000 nodes. [1] R. Rossi, N. Ahmed, The Network Data Repository with Interactive Graph Analytics and Visualization, in Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. [2] D. Szklarczyk, J. H. Morris, H. Cook, M. Kuhn, S. Wyder, M. Simonovic, A. Santos, N. T. Doncheva, A. Roth, P. Bork, et al., The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible, Nucleic Acids Research (2016).

Files