Latent Taxonomic Signatures
Description
This repository contains all the datasets used in the manuscript "Venoms, viruses and orphan proteins - deciphering species proteomes through semantics using Latent Taxonomic Signatures", this describes the material content in the order in which it is being referenced and used throughout the manuscript: "concat_cluster_phylo.fasta" - a dataset containing FASTA formatted 13 highly conserved mitochondrial protein sequences of 96 different squamates, 36 of them belonging to proposed toxicofera clade with 36 caecilian species as an outgroup, concatenated as single sequence constructs, in the same order. "included_organisms.csv" - a comma separated file containing taxonomic information on all 147,058 taxa used to construct tha main LSA model described in the manuscript. "selected_organisms.csv" - a comma separated file containing taxonomic information on all 58,343 taxa used in the taxonomic benchmarking tests, where only taxa represented with proteomes sufficiently large to exclude 500 test query proteins have been used. "query_proteins.zip" - a .zip compressed archive containing FASTA formatted protein sequences from 3,217 organisms representing all 4 kingdoms used in LSA vs BLAST taxonomic benchmarking comparison. The sequences have been separated according to the 4 kingdoms: archaea (txId 2157), bacteria (txId 2), viruses (txId 10239) and eukaryota (txId 2759). "BLAST_vs_LSA_organisms.csv" - a comma separated file containing taxonomic information on all 3,217 taxa used in BLAST vs LSA taxonomic benchmarking comparison. "restricted.zip" - a .zip compressed archive containing all 1,653 taxa orphan and Clusters protein sequences in FASTA format. Each sequence has taxonomic identifier inserted in it's header so that they can easily be grouped taxonomically. "orphan_organisms.csv" - a comma separated file containing taxonomic information on all 1,653 taxa used to extract taxonomically restricted (orphan) protein sequences. "Drosophila_melanogaster_hypothetical.fasta" - a FASTA sequence file containing 1,334 Drosophila melanogaster protein sequences, all annotated as "hypothetical" and therefore considered as taxonomically restricted based on sequence-alignment to established protein families and other species proteins. "venomous_non_venomous.zip" - a .zip compressed archive containing all 176 venomous (44) and non-venomous (3 x 44 = 132) animal proteome sets together with bacteria (44 additional proteomes) used as an outgroup, in form of multi-FASTA files. NCBI Taxonomic identifiers have been inserted in all FASTA sequence headers. "venomous_vs_non_venomous.xlsx" - an Excel file containing taxonomic information on all taxa included in venomous vs non-venomous animal LTS comparison.