Software of prediction models for 12 poly(A) signal variants in human and feature vectors of regions covering [-300,poly(A) hexamer,+300] for these 12 signal variants.

Published: 15 November 2018| Version 1 | DOI: 10.17632/c495bkk9vf.1
Contributors:
Fahad Albalawi, Abderrazak Chahid, Xingang Guo, Somayah Albaradei, Arturo Magaña Mora, Boris Jankovic, Mahmut Uludag, Christophe Van Neste, Magbubah Essack, Taous Meriem Laleg-Kirati, Vladimir B. Bajic

Description

We used human genome hg38 from GENCODE folder at EBI ftp server (ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh38.primary_assembly.genome.fa.gz) 1) Positive set (PAS sequences) Using GENCODE annotation for poly(A) (ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.polyAs.gff3.gz) We selected poly(A) signal annotation. Using bedtools-slop option, we found regions extended 300 bp upstream and 300 bp downstream of the poly(A) hexamer. With the bedtools-getfasta option, we extracted 606 bp fasta sequences from these regions. After eliminating duplicates, we obtained 37’516 presumed true functional poly(A) signal (PAS) sequences. Sequences from this set will be denoted as positive. 2) Negative set (pseudo-PAS sequences) For the negative set, we looked for regions extended outside the region covering 1’000 bp upstream and downstream of the positive poly(A) hexamer signal using bedtools-complement. Homer tool was used to find matches for the 12 most frequent human poly(A) variants. Since the number of matches was huge, sampling was used to select 37’516 pseudo-PAS sequences. Sampling was done from each chromosome proportionally to the chromosomes lengths and also to the expected frequency of the poly(A) variants. Out of these predictions, for each PAS hexamer, we selected the same number of pseudo-PAS sequences as in the positive set. 3) Training and testing sets We selected randomly from each of the positive and negative datasets 20% of sequences for the independent test data. The testing set thus consisted of 15’020 sequences. The remaining data represented the training set that consisted of 60’012 sequences. Both datasets are balanced relative to the true PAS and pseudo-PAS sequences. 4) Processed sequences These sequences are processed by the Matlab code we provide, and sequences are converted to feature vectors of length 205. To the end of each of these features vectors, the class label ('1' for true PAS and '0' for pseudo-PAS) is added. These processed sequences are provided here. 5) Prediction models In addition, we provide software encoding final models for prediction of each of the 12 PAS variants. For each PAS variant, there is a Linear Regression model and a Deep Neural Network (DNN) model.

Files

Steps to reproduce

Please go to the GitHub link (https://github.com/EMANG-KAUST/PolyA_Prediction_LRM_DNN) and follow the instructions described in the "ReadMe.md" file.

Institutions

King Abdullah University of Science and Technology

Categories

DNA, Polyadenylation, Genome Annotation, Genome Sequencing

Licence