The Centrifuge database was built by 16s rRNA sequencing of the type strain from the National Center for Biotechnology Information. This database can be used to do taxonomic classification of bacterial groups.
Steps to reproduce
Download Centrifuge from the source package. NCBI 16S RefSeq database contains sequences from multiple BioProject (33175[BioProject] OR 33317[BioProject]). The collection was created by first extensively updating NCBI taxonomic resources to include the most up to date lists of published bacteria/archaea names and associated type materials. To create and download the fasta files with showing GI from 33175[BioProject] OR 33317[BioProject]. Building an index with any database requires the user to creates a sequence ID to taxonomy ID map that can be generated from a GI taxid dump: # Get mapping file wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz gunzip -c gi_taxid_nucl.dmp.gz | sed 's/^/gi|/' > gi_taxid_nucl.map # build index using 16 cores centrifuge-build -p 16 --conversion-table gi_taxid_nucl.map \ --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp \ sequence.fasta NCBI_202207 Custom database To build a custom database, you need the provide the following four files to centrifuge-build: --conversion-table: tab-separated file mapping sequence IDs to taxonomy IDs. Sequence IDs are the header up to the first space or second pipe (|). --taxonomy-tree: \t|\t-separated file mapping taxonomy IDs to their parents and rank, up to the root of the tree. When using NCBI taxonomy IDs, this will be the nodes.dmp from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz. --name-table: '|'-separated file mapping taxonomy IDs to a name. A further column (typically column 4) must specify scientific name. When using NCBI taxonomy IDs, names.dmp is the appropriate file. reference sequences: The ID of the sequences are the header up to the first space or second pipe (|) We useded NCBI 16S RefSeq fasta file released on July 2022 to build the database