Latent Taxonomic Signatures

Published: 29 June 2021| Version 1 | DOI: 10.17632/8tv3dc26vg.1


The Folder "Supplementary material" contains all supplementary data referenced in manuscript "Latent Taxonomic Signatures: alignment free approach reveals semantic properties of species proteomes", this describes this supplementary material content in the order in which it is being referenced: Supplementary Figure 1.docx – Figure describing LSA language model scheme Supplementary Table 1.xlsx – Excel sheet containing information on taxa included in LSA species model Supplementary Figure 2.docx – Figure displaying protein tokenization scheme, cosine similarity and taxonomy assignation employed in voting scenario method Supplementary Table 2.docx – Table containing download links for FASTA files with “train” and “test” protein sequence sets used in this study. Supplementary Table 3.docx – Table displaying percentage of initial taxa query space as defined by available taxonomy lineage data – used in SBH and VSM method-benchmarking tests Supplementary Table 4.xlsx – Excel sheets containing both relaxed orphan sequence dataset and NCBI Clusters dataset used in this study Supplementary Dataset – Zip archive containing FASTA formatted sequences comprising “stringent” orphan dataset from randomly selected species (species NCBI taxId is in the file name) Supplementary Figure 3.docx – Figure displaying schematic overview of protein family taxonomic deconstruction and resulting species vector “intra-class” and “inter-class” comparison Supplementary Table 5.xlsx – Excel containing sequence information from protein family taxonomy based groups used in “selfish” vs “altruistic” mode of evolution experiment



Sveuciliste u Zagrebu Prehrambeno-Biotehnoloski Fakultet


Bioinformatics, Systems Biology, Taxonomy, Concerted Evolution