Addressing the pervasive scarcity of structural annotation in eukaryotic algae

Published: 24 July 2023| Version 2 | DOI: 10.17632/b32nw6rrfh.2
Contributors:
Taehyung Kwon,
,

Description

[Description] Predicted gene sets using Braker-EP and Braker-ES for (i) unannotated_135.tar.gz: 135 eukaryotic algal genome assemblies without structural annotations (ii) qualified_45.tar.gz: 45 high-quality eukaryotic algal genome assemblies used for benchmark (iii) published_genome_sequences: 210 publicly available genome sequences (nucleotide fasta) used in this study. [Instruction] (i) Decompress the compressed files tar -xvzf unannotated_135.tar.gz tar -xvzf qualified.tar.gz (ii) Find the files you want gff3-formatted structural annotation = GENOME_IDENTIFIER.BRAKER_METHOD_USED.gff coding sequence nucleotide fasta = GENOME_IDENTIFIER.BRAKER_METHOD_USED.cds.fna protein sequence fasta = GENOME_IDENTIFIER.BRAKER_METHOD_USED.faa [Disclosure] This data was cleared for public release by the Los Alamos National Laboratory (LA-UR-21-30120).

Files

Steps to reproduce

(i) Braker-ES: We performed Braker2 gene prediction with ab initio gene finding algorithm GeneMark-ES (--esmode). We lowered the minimum contig size to 10 Kb (--min_contig=10000) and enabled softmasked region detection for softmasked genomes (--softmasking). (ii) Braker-EP: We performed Braker2 gene prediction with extrinsic evidence of protein sequences of any evolutionary distance using GeneMark-EP/EP+, ProtHint, Spaln2, and DIAMOND. To generate extrinsic protein hint files, we first parsed clade-specific orthologous gene set of the following algal clades from the OrthoDB v10.0 database: Eukaryota, Chlorophyta, Chlorophyceae, Rhodophyta, and Bacillariophyta. For each clade, we collected all protein sequences in OrthoDB v10.0 OGs that can be found in 80% or more of species of the clade. We then performed Braker2 gene prediction using protein hint (--epmode, --prot_seq) with the same parameters as Braker-ES (--min_contig=10000, --softmasking). Based on the taxonomic classification of each genome, we used protein hint sequences of the most specific clade.

Institutions

Los Alamos National Laboratory Bioscience Division

Categories

Genome Annotation, Eukaryotic Genetics

Funding

Los Alamos National Laboratory

20200562ECR

Licence