Global characterization of eukaryotic algal biosynthetic gene clusters using domain architectures

Published: 8 December 2023| Version 1 | DOI: 10.17632/n9dgpr3t7d.1
Contributors:
Taehyung Kwon, Blake Hovde

Description

[antiSMASH output (GenBank format)] antiSMASH output GenBank files of the eukaryotic algal biosynthetic gene clusters (BGC) detected in Kwon et al. (2023). The candidate BGC identifiers are available in the attached table (Table S3. candidate BGC summary.xlsx). [BGC clusters] clustered candidates & clustered references: clusters including either only candidate BGCs or only reference BGCs hit candidates: clusters including "hit" candidate BGCs and reference BGCs [Disclosure] This data was cleared for public release by the Los Alamos National Laboratory (LA-UR-22-20439).

Files

Steps to reproduce

1) Detection of candidate and reference biosynthetic gene clusters For all collected genome assemblies, we selected the longest isoforms per each gene in structural annotations using agat_sp_keep_longest_isoform.pl and merged overlapping loci using agat_convert_sp_gxf2gxf.pl (--mergi_loci) of AGAT v0.8.0 package 1. For each genome sequence and corresponding annotation data, we performed antiSMASH with “--taxon=fungi”, ClusterCompare “--cc-mibig”, and ClusterBlast “--cb-knownclusters” parameters enabled. After excluding BGCs on the contigs shorter than 10 kb, we classified candidate BGCs into BGC classes and subclasses using “product” labels that were predicted by antiSMASH. In addition, we retrieved reference BGCs from the MIBiG database v3.1 2. 2) Pair-wise biosynthetic domain architecture similarities For modular BGCs (NRPS and modular PKS), we vectorized biosynthetic domains of each BGC into biosynthetic domain architecture (BDA). For each BGC class, we performed pair-wise BDA alignments between all pairs of candidates and references. We used an alignment scoring matrix employing similarities between every two biosynthetic domains profile HMMs 3 using Profile Comparer v1.5.6 3. With the alignment scoring matrix, we used MAFFT v7.471 4 (--globalpair, --allowshift, --op=0, --gop=0, and --ep=0). To estimate pair-wise BDA similarity, we calculated uncorrected p-distance for each pair-wise alignment. 3) Clustering based on biosynthetic domain architecture similarity With the pair-wise BDA similarity matrix, we clustered all BGCs with similarities of 0.8 or higher. We greedily grouped candidates/references into a cluster where one member has BDA similarity of 0.8 or higher with any of the members of the cluster. Accordingly, candidate BGCs were classified into three groups: “orphan”, “clustered”, and “hit”. Hit candidate BGCs refer to the candidates of which BDAs are similar to those of the reference BGCs (0.8 ≤ BDA similarity). Clustered candidates refer to the candidates of which BDAs are not similar to BDAs of any reference but similar to BDAs of the other candidates. Orphan candidates refer to the candidate BGCs of which BDAs are not similar to any other candidates or references. 1. Dainat, J., Hereñú, D. & Pucholt, P. AGAT: Another Gff Analysis Toolkit to handle annotations in any GTF. GFF format. Zenodo (2020). 2. Terlouw, B.R. et al. MIBiG 3.0: a community-driven effort to annotate experimentally validated biosynthetic gene clusters. Nucleic acids research 51, D603-D610 (2023). 3. Madera, M. Profile Comparer: a program for scoring and aligning profile hidden Markov models. Bioinformatics 24, 2630-2631 (2008). 4. Katoh, K. & Standley, D.M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution 30, 772-780 (2013).

Institutions

Los Alamos National Laboratory Bioscience Division

Categories

Genomics, Biosynthesis Pathway

Funding

Los Alamos National Laboratory

20200562ECR

Licence