Synthetic disjoint and overlapping dynamic attributed networks with ground-truth information
Description
1. Synthetic network 1: We use the synthetic benchmark DANCer to create 12 datasets with ten timestamps for attributed graphs with undirected edges that can change over time, where nodes are grouped into densely connected sets, relatively homogeneous according to the attributes. The number of nodes starts at 256 nodes, the number of communities at 5, and the number of edges at 1000 increases over time, ending with a maximum value of 1868 nodes, ten communities, and 11905 edges among the built networks. 2. Synthetic network 2: Graphs were built with 200 nodes and 20 snapshots. In the initialization, the nodes are divided into two groups, each with 100 nodes. Then, at a randomly selected time step, 40 % of the nodes are chosen to migrate to a new community in Dataset 1, and 80 % in Dataset 2. The membership of each node is chosen according to a stochastic block model, where nodes within the same community are connected with a probability of 0.3 and the edges between communities are drawn with a probability of 0.1. 3. Synthetic network 3: We generated ten timestamps with growing and shrinking communities. We designed three datasets with three groups of 80, 90, and 100 nodes. A degree-corrected stochastic block model generates the network structure. Three types of attributed networks were proposed, with different original link probabilities, to assess strongly assortative structures (Dataset 1), weakly assortative structures (Dataset 2), and disassortative structures (Dataset 3). 4. Overlapping synthetic networks: Using Greene's benchmark, we create eleven overlapping datasets with ten snapshots. Each network has 250 nodes, overlapping nodes from 0 to 50, and overlapping memberships from 2 to 25. Specific kinds of changes in communities of a dynamic network were simulated: - Birth and death: There are two birth events per time step and two death events per time step. Dataset 1 has no overlapping nodes; Dataset 2 has five overlapping nodes and two overlapping memberships; Dataset 3 has ten overlapping nodes and two overlapping memberships; Dataset 4 has 25 overlapping nodes and two overlapping memberships; Dataset 5 has 50 overlapping nodes and two overlapping memberships; Dataset 6 has five overlapping nodes and five overlapping memberships; Dataset 7 has five overlapping nodes and ten overlapping memberships; Dataset 8 has five overlapping nodes and 25 overlapping memberships. - Merge and split: With two merge events per time step and two split events per time step. Dataset 9 has five overlapping nodes and two overlapping memberships. - Expansion and contraction: With two expansion events per time step and two contraction events per time step, with a rate of 0.1. Dataset 10 has five overlapping nodes and two overlapping memberships. - Node switching between communities: With a probability of 0.2 for a node to switch its community membership. Dataset 11 has five overlapping nodes and two overlapping memberships.
Files
Steps to reproduce
The datasets were used in the paper "Detecting disjoint and overlapping communities in temporal node-attributed networks". Synthetic network 1 were built with the synthetic benchmark proposed by Largeron et al. (2017), doi:10.1007/s10115-017-1028-2, with the code available at https://perso.univ-st-etienne.fr/largeron/DANC_Generator/. Synthetic network 2 was based on Sheikholeslami and Giannakis (2018), doi:10.1109/TSP.2018.2871383. Synthetic network 3 was based on Tang et al. (2020), doi:10.1007/s00180-019-00909-8, with the addition of time steps. Overlapping synthetic networks were based on Greene et al. (2010), doi: 10.1109/ASONAM.2010.17, with the code available at http://mlg.ucd.ie/dynamic/. About attributes: Attributes for each dataset were added as follows. For a network of $c$ communities, each one was assumed to have a strong correlation with $h$ binary attributes and a weak correlation with $h \cdot (c-1)$ binary attributes. We use $h = 1$ and $h = 3$ for our tests. The probabilities of having a strong correlation (pin) were varied from 0.5 to 1, while the probabilities of having a weak correlation were set with 0.05 or 0.1 (pout). For example, a folder named pin06pout005 indicates that the probability of having a strong correlation is 0.6 and the probability of having a weak correlation is 0.05. About files: Each dataset contains ten folders, each with a different seed. Inside each folder: - .gml files represent the topology for each time step. - The files starting with the letters GT* and OGT* are the ground truths for the disjoint and benchmarks, respectively. - The file starting with K* is the number of communities for each time step for the disjoint benchmarks. - The file finalizing in *Index.txt contains the node numbers for each time step needed to track the creation and deletion of nodes, and it corresponds with attribute values. - The attributes are in .csv format. - For overlapping synthetic networks, files with extension .comm, .edges, .stats, and .timeline, are format files generated by the benchmark. For example, Synthetic network 1/Synthetic network 1 Dataset 1/Data122NS3/ refers to the seed 3 of the first dataset of synthetic network 1, where: - Data122NS3Index.txt are the indices of nodes for all timestamps of seed 3. - Data122NS3t4.gml contains the topology for time step 4 of seed 3. - GTData122NS3t4 contains the ground truth for time step 4 of seed 3. - KData122NS3.txt contains the number of communities for all timestamps of seed 3. The maximum of communities in this case is 9, for timestamps 9 and 10. - pin06pout005/Data122NS3H1Attrt4.csv contains the attributes for time step 4 of seed 3, where nodes inside a community have one binary attribute (h=1) with a probability of 0.6 to have the same value as other nodes of the same community. The nodes inside a community have another eight binary attributes (1 * (9-1)) with a probability of 0.05 to have the same value as other nodes of other communities.