PHYTOPK28-D1D2: A curated database of 28S rRNA gene D1-D2 domains from eukaryotic organisms dedicated to metabarcoding analyses of marine phytoplankton samples
The PHYTOPK28-D1D2 database comprises accession numbers, taxonomic classification and 28S rDNA (D1-D2 domains) sequences that are available in public DNA databases. The sequences, listed in FASTA format, are identified by the accession number and the hierarchical taxonomy information. The PHYTOPK28-D1D2 database was built for the taxonomic annotation of DNA metabarcodes generated from water samples collected in six French Mediterranean lagoons, once a month between May and September/October 2012, and fractionated by size (three size ranges: 0.7-5 µm, 5-20 µm and 20-100 µm). This metabarcode dataset was deposited in the European Nucleotide Archive under the accession number PRJEB18757. The PHYTOPK28-D1D2 database was started with an initial dataset that was retrieved on the April 19, 2013 from the ribosomal DNA database SILVA. Further sequences were added by extensive BLAST searches in the NCBI/GenBank nucleotide database by targeting the main taxonomic divisions among eukaryotic, marine or freshwater, algal and plankton lineages, and excluding environmental sequences. The hereby first version of the database assembled by the end of June 2015, PHYTOPK28-D1D2_v1, reached 8,753 reference sequences, including more than 3,600 from algal/phytoplanktonic lineages (Chlorophyta, Cryptophyta, Dinophyceae, Haptophyceae, Stramenopiles, Rhodophyta, Euglenozoa, Rhizaria, Glaucocystophyceae) and ~700 from microzooplankton (including ciliates, rotifers, copepods) when it was used for computing the annotation of the metabarcode library. It is not claimed that this PHYTOPK28-D1D2 database is exhaustive with respect to its purpose. It is not warranted that the database does not contain overlooked identification errors from undetected errors originating from the deposition in public databases or from missed literature reporting taxonomic changes. The database can also lack recently released data at the time of use in June 2015. It is intended to further enrich the database by adding new–mostly recently released–sequence accessions and to make a new database version available from time to time. Anyone interested in receiving a recently updated database can contact the first author (DG). Any information reporting errors, omissions or recently released sequences would also be welcome to help in this updating effort. It would be interesting to make this database become richer by adding more information on reference sequences, for example by linking the accession numbers to GenBank database information, by adding and linking to the article reference related to the sequence submission (an information that is not always updated in the public DNA databases) and eventually, the subsequent literature references leading to changes in the taxonomic name or in the classification of organisms.
Steps to reproduce
An initial database was retrieved from the ribosomal DNA database SILVA on the April 19, 2013, searching for sequences containing both flanking PCR primers (D1R 5’-ACCCGCTGAATTTAAGCATA-3’ and D2C 5’-CCTTGGTCCGTGTTTCAAGA-3’; Scholin et al., 1994, J Phycol 30:999-1011) that were used for generating the 28S rDNA D1-D2 amplicons. It contained nearly 5,000 sequences (removed several hundred of redundant and non-pertinent sequences from Zea mays and Sorghum bicolor), including only ~400 sequences related to algal/phytoplankton lineages and ~60 related to microzooplankton. These limited numbers were likely due to the search condition of both primers being present. Further sequences were added by extensive BLAST searches in the NCBI/GenBank nucleotide database by targeting the main taxonomic divisions among eukaryotic, marine or freshwater organisms, mostly focusing on phytoplankton, macroalgae (having microbial sexual planktonic stages) and microzooplankton in the targeted size range (up to ~100 µm). BLAST searches were also conducted using the most abundant barcode sequences as query, and subsequently by widening searches to the corresponding targeted taxonomic divisions. The selection of BLAST-hit reference sequences first relied on a 100% length query cover between the two PCR primers, but accepting shorter sequences—mostly not shorter than 90% of length—for non-redundant taxa. Redundant identical sequences were retained but were limited to few occurrences. The NCBI/GenBank taxonomic identifications of reference sequences were updated when necessary, mostly by doing changes in species, genus or family names, or corrections of misidentified organisms, to the best of our knowledge from the recent literature. The taxonomic database ALGAEBASE (Guiry and Guiry, http://www.algaebase.org) was also consulted. Some identifications were set to what seemed to be the most common usage in the literature: for example, family Pfiesteriaceae (genera Pfiesteria, Pseudopfiesteria, Stoeckeria, Tyrannodinium, Luciella, Chimonodinium, Aduncodinium) was separated from Thoracosphaeraceae. Some information was adapted to allow an assignation to the closest node (i.e., the last common ancestor) for barcodes distantly related to known reference sequences. For example, a superclass level named “CS clade” was created to cover the sister classes Chrysophyceae and Synurophyceae, with respect to the relatively low number of available reference sequences among the described species in both classes. A species complex level was set for known clades containing several morphological species with identical or overlapping rDNA sequences (e.g., Dinophysis acuminata/sacculus/ovum clade). From BLAST results and phylogenetic analyses, the accession numbers AB117928 (Tilopteris mertensii, Phaeophyceae) and L38642 (Gymnodinium catenatum, Dinophyceae) were reassigned to the genus Spumella (Chrysophyceae) with the assignation “Spumella sp.|probable contamination of declared organism’s name“.