Graph datasets for clustering

Published: 20 June 2024| Version 2 | DOI: 10.17632/fzjyprkh3h.2
Xianbin Lu


The CORA dataset consists of seven distinct categories of scientific papers. It comprises 2708 papers, with each paper represented as a node in the network. There are 5429 citation links, each representing a directed edge from one paper (node) to another, indicating a citation relationship. Each paper is represented by a 1433-dimensional feature vector, where each value is 0 or 1, indicating the absence or presence of specific words from a predefined dictionary. CITE is a citation network dataset consisting of papers from six distinct research categories: Agents, Artificial Intelligence (AI), Databases (DB), Information Retrieval (IR), Machine Learning (ML), and Human-Computer Interaction (HCI). The dataset comprises 3327 academic papers. Each paper is represented by a 3703-dimensional word vector, indicating the absence or presence of specific words from a predefined dictionary. Additionally, the dataset includes 4732 citation links between papers, reflecting the citation relationships among papers. The DBLP dataset is derived from the DBLP computer science bibliography and represents a co-authorship network. Each node corresponds to an author, and an edge between two nodes indicates that the corresponding authors have co-authored at least one paper together. It contains 4058 nodes and 3528 edges, with each author represented by an 334-dimensional feature vector that describes their research areas. The ACM dataset is a paper network, derived from the ACM database. It contains a total of 3025 papers categorized into three categories: database, wireless communication, and data mining. Each paper is represented by a 1870-dimensional vector based on the research area of the article. There is an edge between two papers if they are written by the same author.