Core genome MLST scheme for the Bacillus cereus group

Published: 9 November 2022| Version 1 | DOI: 10.17632/yd7n6xygvb.1
Contributors:
,

Description

This entry contains raw data sets from chewBBACA and Panaroo, used to create and validate a core genome MLST scheme for the Bacillus cereus group of bacteria. 1. Creation of a cgMLST scheme from 173 closed B. cereus group genomes using chewBBACA (file S1). The latest assemblies of 2458 Bacillus cereus group genomes available (October, 2021) were downloaded from the NCBI GenBank FTP site, of which 173 were complete (i.e., ungapped, fully closed chromosome and plasmid sequences) genomes. A cgMLST scheme was defined using the chewBBACA 2.8.5 pipeline on the closed genome set. 2. Core genome from 173 closed Bacillus cereus group genomes using Panaroo (file S2). For comparison we analyzed the core genome using the most recent pan-genome software, Panaroo. 3. Application of the cgMLST scheme (from 1) to 2458 B. cereus group genomes (file S3). The cgMLST scheme created from 173 closed B. cereus group genomes using chewBBACA (1) was applied to the full set of 2458 B. cereus group genomes available. Allelic profiles for the loci in the scheme were determined using the "AlleleCall" operation in chewBBACA.

Files

Steps to reproduce

1. “CreateSchema” was used to create a gene-by-gene scheme based on the set of complete genomes; “AlleleCall”, to determine the allelic profiles based on the scheme, and “ExtractCgMLST”, to define the set of loci that constitute the core genome, producing a set of core loci present in a predefined proportion of the genomes and which are extracted. All three analyses were run using default parameters (including a BSR cut-off of 0.6 and 20% gene size difference), except for the minimal CDS length in “CreateSchema'' which was set to 90 bases (option “--l 90”) and the genes were selected to be part of the core genome if they were present in a threshold proportion of 99% of the complete genomes in “ExtractCgMLST” (“--t 0.99”). In the extraction stage, paralogous loci detected by “AlleleCall” and two genome assemblies showing the highest numbers of missing loci (GCA_002243685.1 and GCA_000724585.1) were excluded (using options “--r” and “--g”), as recommended in chewBBACA. ChewBBACA identifies the CDS by means of the Prodigal gene prediction tool, which requires a training phase to learn the coding properties of the input organism. For the sake of comparison and consistency, Prodigal was trained on the genome sequence of the Bacillus anthracis Ames Ancestor strain, which was used as a reference for creating the B. anthracis cgMLST scheme. Prodigal 2.6.3 was run with the options “-p single” and “-c”. As the B. cytotoxicus species has a significantly smaller genome, and thus gene content, than other members of the B. cereus group (~1 Mbp smaller; ~1000 genes less), and the scheme was designed for the whole B. cereus group, a set of 76 genes that were missing only in B. cytotoxicus were removed from the scheme. Nine loci that were found to be duplicated in more than 5% of all available genomes and five loci that did not overlap with the GenBank annotation were deleted from the scheme. The final cgMLST scheme contains 1568 genes. 2. The 173 closed B. cereus group genomes were annotated using Prokka 1.14.6 (with options "--addgenes --usegenus --genus Bacillus --kingdom Bacteria --gcode 11 --evalue 1e-09 --coverage 50 --mincontiglen 200") to provide consistent annotation files (in GFF3 format including gene sequences). Using these annotations, Panaroo 1.2.10 was run in strict mode ("--clean-mode strict") with a core threshold of 99% ("-a core --core_threshold 0.99"), the MAFFT 7.407 alignment program ("--aligner mafft") and all other default options. 3. "AlleleCall" was run with default parameters (including a BSR cut-off of 0.6 and 20% gene size difference). In addition, "TestGenomeQuality" was used to check the quality of the genome assemblies by counting the number of core loci that can be recovered in 99% of the genomes as genomes containing more and more missing loci are added to the analysis. The options were set to 12 iterations, a maximum of 1050 missing loci per genome, and a step of 10 missing loci ("-n 12 -t 1050 -s 10").

Institutions

Universitetet i Oslo Det Matematisk-naturvitenskapelige Fakultet, Universite de Bordeaux

Categories

Genome, Genome Variation, Genome Sequencing

Licence