SURE-Pipe Benchmarking datasets

Published: 28 May 2026| Version 1 | DOI: 10.17632/tgwdzc5kxg.1
Contributor:
Infant Thomas

Description

The performance of SURE-Pipe was evaluated using simulated and real-world comparative genomics datasets. Simulated genomes were generated using a custom in-house simulator based on the Stan framework, modelling coalescent divergence, point mutations, and structural rearrangements between target and neighbouring clades. Each simulation contained five target and five neighbouring genomes with >97% ANI, representing closely related genomes with subtle sequence variation. Defined shared and unique regions (200–5000 bp) were embedded within target genomes to establish benchmarking ground truth. Shared regions exhibited among target genomes and neighbours, while unique regions were present only in the target group. Additional structural variations, including inversions and duplications, were introduced to mimic realistic genomic rearrangements. Simulations were generated across genome sizes ranging from 100 kb to 75 Mb. For accuracy assessment, 150 independent simulations using 4 Mb genomes were performed, and the outputs of SURE-Pipe v1.1 were compared with KEC v1.1 and FUR v4.3 using nucleotide-level sensitivity, specificity, precision, accuracy, and Matthews correlation coefficient (MCC). Computational scalability was further assessed through genome-size and genome-number scaling experiments. Genome-size benchmarking used datasets of five target and five neighbouring genomes ranging from 0.1 Mb to 640 Mb. Genome-number benchmarking fixed genome size at 5 Mb while increasing target:neighbour datasets from 5T:20N up to 160T:640N genomes. Runtime and peak memory usage were recorded using /usr/bin/time on a Pop!_OS 22.04 LTS workstation equipped with an Intel® Xeon® E-2124G CPU (4 cores, 3.40 GHz) and 16 GB RAM. The pairwise genome comparison mode was validated using six genome pairs spanning diverse taxonomic groups, genome sizes (10.7 kb–69.5 Mb), and GC contents (27–76%), including viral, bacterial, and fungal genomes. Additionally, groupwise genome comparison was performed across 24 closely related Bacillus species to identify species-specific genomic regions. Four genomes per species (>97% ANI) were selected, while neighbouring datasets included reference genomes from 42 Bacillus species. This analysis enabled the identification of conserved intraspecies regions, species-specific unique regions, and regions shared among neighbouring taxa.

Files

Institutions

Categories

Comparative Genomics, Genome

Licence