Addressing the pervasive scarcity of structural annotation in eukaryotic algae
Description
[Description] Predicted gene sets using Braker-EP and Braker-ES for (i) unannotated_135.tar.gz: 135 eukaryotic algal genome assemblies without structural annotations (ii) qualified_45.tar.gz: 45 high-quality eukaryotic algal genome assemblies used for benchmark (iii) published_genome_sequences: 210 publicly available genome sequences (nucleotide fasta) used in this study. [Instruction] (i) Decompress the compressed files tar -xvzf unannotated_135.tar.gz tar -xvzf qualified.tar.gz (ii) Find the files you want gff3-formatted structural annotation = GENOME_IDENTIFIER.BRAKER_METHOD_USED.gff coding sequence nucleotide fasta = GENOME_IDENTIFIER.BRAKER_METHOD_USED.cds.fna protein sequence fasta = GENOME_IDENTIFIER.BRAKER_METHOD_USED.faa [Disclosure] This data was cleared for public release by the Los Alamos National Laboratory (LA-UR-21-30120).
Files
Steps to reproduce
(i) Braker-ES: We performed Braker2 gene prediction with ab initio gene finding algorithm GeneMark-ES (--esmode). We lowered the minimum contig size to 10 Kb (--min_contig=10000) and enabled softmasked region detection for softmasked genomes (--softmasking). (ii) Braker-EP: We performed Braker2 gene prediction with extrinsic evidence of protein sequences of any evolutionary distance using GeneMark-EP/EP+, ProtHint, Spaln2, and DIAMOND. To generate extrinsic protein hint files, we first parsed clade-specific orthologous gene set of the following algal clades from the OrthoDB v10.0 database: Eukaryota, Chlorophyta, Chlorophyceae, Rhodophyta, and Bacillariophyta. For each clade, we collected all protein sequences in OrthoDB v10.0 OGs that can be found in 80% or more of species of the clade. We then performed Braker2 gene prediction using protein hint (--epmode, --prot_seq) with the same parameters as Braker-ES (--min_contig=10000, --softmasking). Based on the taxonomic classification of each genome, we used protein hint sequences of the most specific clade.
Institutions
Categories
Funding
Los Alamos National Laboratory
20200562ECR