Empirical analysis of eukaryotic ER signal peptides

Published: 5 August 2021| Version 1 | DOI: 10.17632/p65tkrr89v.1
Contributors:
,

Description

Content: - An excel file containing sequences from all experimentally verified eukaryotic signal peptides available on UniProtKB in May 2021, (as well as all human signal peptide sequences >60 amino acids and selected cases from the literature), and probabilities for their respective n-, h-, and c-regions. - Signal peptide sequences were automatically extracted, and subjected to region probability prediction by the Hidden Markov models algorithm in SignalP 3.0. Additionally, all human entries were subjected to manual curation, whereby adjacent hydrophobic stretches were added to the h-region. - The data contains the following information content for 1,492 entries: UniProtKb identifier, review status, protein name, organism, taxonomic lineage, full protein sequence, experimental signal peptide sequence, experimental signal peptide length, predicted n-region sequence and length, predicted h-region sequence and length, predicted c-region sequence and length, experimental cleavage site, predicted cleavage site, difference between predicted and experimental cleavage site, prediction probability, author comment. Also contains the raw data pulled from UniProtKb and the raw output from SignalP 3.0. - The data has been separated into subsets with different cleavage site predictions and probabilities as well as different evolutionary lineages (found in different tabs of the excel file). The following tabs can be found: Summary; Cleavage site identical to experimental; Cleavage site different from experimental; Low probability; Humans (identical cleavage site); Humans (manually curated); Very long SPs; Vertebrates (identical cleavage site); Protostomes (identical cleavage site); Plants (identical cleavage site); Fungi(identical cleavage site). For each subset, the average and median length, standard deviation, and minimal/maximal length for each region are reported. Implications of the data: - The data shows that the length of eukaryotic h-regions is, by average and mean, 11 amino acids, with a maximum length of 14 amino acids for high-probability predictions (>0.5). This is considerably shorter than the hydrophobic segments of non-cleaved transmembrane helices. - Further, the data show that the bulk of length variation in eukaryotic signal peptides stems from the h-region. - A substantial subset (19%) of the data show a different predicted and reported experimental cleavage site, for unclear reasons.

Files

Steps to reproduce

Experimentally verified SP sequences were extracted from UniProtKB (The UniProt Consortium, 2021) using the following query: annotation:(type:positional signal length:[22 TO 200] evidence:experimental) taxonomy:(2759) Entries that did not contain SP sequences were removed manually, resulting in a set of 1,492 SPs. Additionally, computationally predicted human SPs with of 60 or more amino acids were extracted using the query annotation:(type:positional signal length:[60 TO 200]) AND organism:"Homo sapiens (Human) [9606]", resulting in six additional unique sequences. Lastly, a set of nine well-documented long SPs were extracted manually. The sequence ‘KFEKLKFEKL’ was appended to each extracted SP, and all sequences were submitted to SignalP 3.0, which is the latest iteration of SignalP that features SP region predictions using Hidden Markov models (HMM) (Bendtsen et al., 2004). Each residue was assigned to the region with the highest HMM probability (minimum 0.2). The data was split into the following subsets: (i) Sequences with cleavage site probabilities >0.5 and identical experimental and predicted cleavage sites (n=921); (ii) sequences with cleavage site probabilities >0.5 and differing experimental and predicted cleavage sites (n=277); (iii) sequences with cleavage site probabilities <0.5 (n=289), and (iv) sequences for which SP prediction failed completely (n=5). Set (i) was used to generate Fig. 6 B-C in. In order to get a measure of the maximal possible h-region length, all 412 human entries were also manually adjusted to include stretches that were originally assigned to the neighboring regions and that consist of the hydrophobic amino acids Val, Leu, Ile, Phe, Met, or the less hydrophobic residues Trp, Ala, Tyr, Pro, or Gly. Additionally, one interruption by one polar, non-charged amino acid (Ser, Asn, Gln, His) was allowed on each side if followed by at least two hydrophobic residues. Pro-Pro, Pro-X-Pro, and Pro-Gly segments were considered disruptive for TM helices and thus not allowed, and only the fist residue of such motives was added to the h-region. Lastly, and a minimal c-region length of 4 resides was assumed (as is suggested by the structure).

Categories

Endoplasmic Reticulum, Signaling Peptide, Alpha Helix, Secretory Pathway, Empirical Likelihood

Licence