Antology-Graph

Published: 28 April 2026| Version 1 | DOI: 10.17632/mx25zmdxg2.1
Contributor:
raphael salles vitor de souza

Description

This dataset provides a directed graph of semantic proximity between intellectual concepts (philosophers, scientists, artists, educational institutions, religions, ideologies, and fields of knowledge), built from the mappingbased-objects dump (version lang=en) of DBpedia. The graph was produced as part of the project for the course MC859 - Project in Theoretical Computer Science (University of Campinas, 2025), which applies centrality algorithms (PageRank and Personalized PageRank) and community detection methods to open ontologies. Starting from the 22,791,171 semantic triples in the original dump, a subnetwork was extracted based on 11 thematic predicates selected for their relevance to the analysis of intellectual prestige flow: dbo:influencedBy, dbo:influenced, dbo:doctoralAdvisor, dbo:doctoralStudent, dbo:academicAdvisor, dbo:almaMater, dbo:knownFor, dbo:notableWork, dbo:field, dbo:religion, and dbo:ideology. Inverse relations (influencedBy, doctoralAdvisor, academicAdvisor) were semantically reoriented so that all edges point from the influencer to the influenced entity, aligning the graph with the canonical interpretation of PageRank as a measure of prestige. After reorientation and deduplication, the final graph contains 326,270 vertices and 426,617 directed edges, with an average degree of 2.615. The graph is highly fragmented in terms of strongly connected components (approximately 325,000 SCCs, of which 99.97% are singletons), but structurally cohesive when weakly connected components are considered — a typical characteristic of directed ontologies that encode chronological relations of influence. The degree distribution follows a power law (scale-free network), with prominent hubs at educational institutions (Harvard, Cambridge, Yale), religions (Christianity, Islam, Catholicism), and disciplines (Painting, Philosophy, Law).

Files

Steps to reproduce

Step 1. Download the source dump from DBpedia Databus. The file mappingbased-objects_lang=en.ttl.bz2 (approximately 183 MB compressed, containing 22,791,171 triples) is available at https://databus.dbpedia.org/dbpedia/mappings/mappingbased-objects, version 2022.12.01. Step 2. Filter the dump to the 11 thematic predicates. Create a text file named preds.txt containing the following lines, one per line: ontology/influencedBy>, ontology/influenced>, ontology/doctoralAdvisor>, ontology/doctoralStudent>, ontology/academicAdvisor>, ontology/almaMater>, ontology/knownFor>, ontology/notableWork>, ontology/field>, ontology/religion>, and ontology/ideology>. The closing angle bracket at the end of each line is essential to avoid spurious substring matches. Then run: bzcat mappingbased-objects_lang=en.ttl.bz2 | grep -F -f preds.txt > rede_intelectual.nt. This produces a file with 432,975 triples. Step 3. Reorient inverse predicates so that every edge points from the influencer to the influenced entity, aligning the graph with the canonical interpretation of PageRank as a prestige flow. Run the following awk command: awk 'BEGIN { rename["http://dbpedia.org/ontology/influencedBy"] = "http://dbpedia.org/ontology/influenced"; rename["http://dbpedia.org/ontology/doctoralAdvisor"] = "http://dbpedia.org/ontology/doctoralStudent"; rename["http://dbpedia.org/ontology/academicAdvisor"] = "http://dbpedia.org/ontology/academicStudent" } { if ($2 in rename) print $3, rename[$2], $1, "."; else print $0 }' rede_intelectual.nt > rede_intelectual_oriented.nt. This script swaps the subject and object positions for the three inverse predicates and renames them to their semantic complements. Triples for the other eight predicates pass through unchanged. Step 4. Deduplicate redundant triples. DBpedia frequently encodes mutual relations in both directions, for example A influencedBy B and B influenced A. After the reorientation step, both forms collapse to the same canonical triple, and sort -u removes the duplicates: sort -u rede_intelectual_oriented.nt > rede_intelectual_final.nt. The resulting file contains 426,617 triples, with approximately 1.5 percent of the input removed as redundant. Step 5. Convert the deduplicated triples to GEXF and GraphML formats, suitable for use in Gephi, Neo4j, igraph, or NetworkX. The script nt_to_gexf.py, provided with this dataset, maps each unique URI to an integer node identifier and writes the corresponding XML serialization: python nt_to_gexf.py rede_intelectual_final.nt rede_intelectual.gexf. The final graph contains 326,270 vertices and 426,617 directed edges, with an average degree of 2.615.

Institutions

Categories

Ontology, Graph Theory, Knowledge Representation, Network Analysis

Licence