NCodR: A multi-class SVM classification to distinguish between non-coding RNAs in Viridiplantae

Published: 4 January 2021| Version 1 | DOI: 10.17632/87k9rssdm4.1
Contributors:
NITHIN C,
,
Jolly Basak,
Ranjit Prasad Bahadur

Description

Non-coding RNAs (ncRNA) are major players in the regulation of gene expression. However, the identification and classification of ncRNAs are major bottlenecks in understanding their functional roles. This study analyses seven classes of ncRNAs in plants using sequence and secondary structure-based RNA folding measures. Support vector machines employing radial basis function show the highest accuracy in discriminating ncRNAs, and the classifier is implemented as a web server, NCodR. This study will provide a reliable platform for the genome-wide prediction and classification of ncRNAs in plants and enrich our understanding of plant ncRNAs, which may be further used for crop improvements using genome-editing technology. The codes are available at https://gitlab.com/sunandanmukherjee/ncodr while the webserver is available at http://www.csb.iitkgp.ac.in/applications/NCodR/index A dataset of ncRNAs in Viridiplantae was curated to quantify the various sequence and RNA folding measures, which is further used to classify ncRNAs. The dataset was curated from RNACentral, applying the filters on the ncRNA class and species names. The RNACentral consolidates data from several databases. While curating the dataset, the sequences with degenerate letters for the bases and ambiguous letters such as ‘N’ were removed. Sequences that repeated multiple times in the dataset were also removed to make them unique. A total of 5,26,552 sequences curated, is available as ncRNA_seqences.fa.tar.gz file. The different features calculated as described in the Materials and Methods section of the manuscript is available as ncRNAs_features.tar.gz file. Even though the overall dataset of ncRNAs used to train the classifier is diverse in terms of the taxa included, the data for lncRNAs comes only from two major groups: monocots and eudicots (Table 2). The limited availability of data from diverse taxonomic groups for the lncRNA class may cause the classifier to have limited predictive ability for lncRNAs. To test this, we curate an additional dataset of lncRNAs from four different species that were not included in the original training dataset. lncRNA sequences of M. pusilla and S. moellendorffii were downloaded from the GREENC database while that of G. sulphuraria and P. patens were from CANTATAdb. The sequences and prediction results are available in the file Test_Cases_lncRNAs.tar.gz

Files

Categories

Plant (Plant Biology), RNA, Classification (Machine Learning)

License