NCodR: A multi-class SVM classification to distinguish between non-coding RNAs in Viridiplantae

Published: 27 April 2022| Version 2 | DOI: 10.17632/87k9rssdm4.2
Contributors:
,
, Jolly Basak,

Description

Non-coding RNAs (ncRNA) are major players in the regulation of gene expression. However, the identification and classification of ncRNAs are major bottlenecks in understanding their functional roles. This study analyses seven classes of ncRNAs in plants using sequence and secondary structure-based RNA folding measures. Support vector machines employing radial basis function show the highest accuracy in discriminating ncRNAs, and the classifier is implemented as a web server, NCodR. This study will provide a reliable platform for the genome-wide prediction and classification of ncRNAs in plants and enrich our understanding of plant ncRNAs, which may be further used for crop improvements using genome-editing technology. The codes are available at https://gitlab.com/sunandanmukherjee/ncodr while the webserver is available at http://www.csb.iitkgp.ac.in/applications/NCodR/index A dataset of ncRNAs in Viridiplantae was curated to quantify the various sequence and RNA folding measures, which is further used to classify ncRNAs. The dataset was curated from RNACentral, applying the filters on the ncRNA class and species names. The RNACentral consolidates data from several databases. While curating the dataset, the sequences with degenerate letters for the bases and ambiguous letters such as ‘N’ were removed. Sequences that repeated multiple times in the dataset were also removed to make them unique. A total of 5,26,552 sequences curated, is available as ncRNA_seqences.fa.tar.gz file. A dataset of 17, 026 mRNAs was curated to use as ‘others’ category in the training of classifier is available as mRNA_sequences.fa.tar.gz file. The mRNAs from 271 different e species were downloaded from PlantGDB database and were clustered at 50% identity cut-off using CD-hit program. The different features calculated as described in the Materials and Methods section of the manuscript for ncRNAs, mRNAs and lncRNA sequences from PLncDB are available as ncRNAs_features.tar.gz, mRNA_features.tar.gz and lncRNAs_PLncDB_features.tar.gz, respectively. Even though the overall dataset of ncRNAs used to train the classifier is diverse in terms of the taxa included, the data for lncRNAs comes only from two major groups: monocots and eudicots (Table 2). The limited availability of data from diverse taxonomic groups for the lncRNA class may cause the classifier to have limited predictive ability for lncRNAs. To test this, we curate an additional dataset of lncRNAs from four different species that were not included in the original training dataset. lncRNA sequences of M. pusilla and S. moellendorffii were downloaded from the GREENC database while that of G. sulphuraria and P. patens were from CANTATAdb. The sequences and prediction results are available in the file Test_Cases_lncRNAs.tar.gz

Files

Categories

Plant (Plant Biology), RNA, Classification (Machine Learning)

Licence