Burkholderia Genomic RDF Graph

Published: 28 February 2025| Version 1 | DOI: 10.17632/pt6xn9mgdf.1
Contributor:
Reynold Osuna González

Description

Representation of 200 genomes of microorganisms belonging to the genus Burkholderia as an RDF graph

Files

Steps to reproduce

The source files used are obtained from the National Center for Biotechnology Information (NCBI) [(https://www.ncbi.nlm.nih.gov/)], which is part of the National Institutes of Health, in GenBank Flat File (gbff) format. A Python script is used to copy all files with a .gbff extension from individual folders into a working directory, while simultaneously renaming each genomic file to the GenBank identifier for the genome (the name of its individual folder). Then, another Python script is executed to clean the files by removing unnecessary line breaks, ensuring a uniform format that facilitates data extraction. Selection of Features of Interest: Based on a review of the information contained in the gbff files, specific DNA subsections (loci, plural of locus) were identified, including genes, coding sequences (CDS), and RNA fragments with annotations. The selected features include: Sequence Identifiers (locus_tag) Gene and CDS positions Amino acid sequences (translation) Functional annotations and additional metadata For each file (.gbff extension) containing a genome, another Python script removes all unnecessary line breaks so that each line in the output file corresponds to the information that characterizes a single subsection of the input file. Extraction of Information The script extracts the information identifying the LOCUS, its metadata, and creates five lists to obtain the characteristics of each locus described in the original document, corresponding to genes, CDS, tRNA, rRNA, and ncRNA. For each of these characteristics, the script identifies the fields contained in the file—since not all files share the same fields—extracts their values, and stores them in the respective list. As a result, five lists are generated, sharing the first columns, which store the data describing the organism and the LOCUS section, followed by different columns depending on the source file. These lists are then converted into comma-separated values (CSV) files, serving as an intermediate product of Routine 3. Finally, the routine examines the content of the CSV columns to unify the extracted data corresponding to each locus from various microorganisms into a single feature file. The final product of this routine consists of one file per type of locus, already incorporating the information of all the organisms used. Graph Construction from the Generated Tables The graph is built using Python with the RDFlib library. This process involves two main routines: Generating Individual RDF Graphs: Using the CSV files generated in the previous step, an individual RDF-Turtle graph is created for each processed genome. Merging RDF Graphs: Each graph within the directory is merged into a single RDF graph, which can be used to create a new graph from scratch or update an existing one with new genomes. The final output of this process is a TTL-formatted file, corresponding to an RDF syntax representation.

Institutions

Benemerita Universidad Autonoma de Puebla

Categories

Biotechnology, Genome, Knowledge Graph

Licence