The ICAnnoLncRNA Pipeline for a Long-Non-coding-RNA Search and Annotation in Transcriptomic Data: an Application to Maize

Published: 24 April 2023| Version 3 | DOI: 10.17632/fnk8pmp2yz.3
Artem Pronozin, Dmitry Afonnikov


Long non-coding RNAs (lncRNAs) are RNA molecules longer than 200 nucleotides that do not encode proteins. Experimental studies have shown the diversity and importance of lncRNA functions in plants. To expand knowledge about lncRNAs in other species, computational pipelines that allow for standardised data-processing steps in a mode that does not require user control up to the final result were actively developed recently. These advancements enable wider functionality for lncRNA data identification and analysis. In the present work, we propose the ICAnnoLncRNA pipeline for automatic identification, classification and annotation of plant lncRNAs in transcriptomic sequences assembled from high-throughput RNA sequencing (RNA-seq) data. In the present work, we proposed a pipeline ICAnnoLncRNA for automatic prediction, classification, and annotation of plant lncRNAs. We analysed sequences from 15 maize transcriptome libraries from different plant tissues/organs.


Steps to reproduce

Input data. 15_lib_trinity.fasta - input data. new_vs_old_id.tsv - new ID and old ID of transcripts. LncRNA filtering. Noncoding.fasta - Transcripts predicted as lncRNA by lncFinder, FASTA format. filter_alignm.bed - results of lncRNA transcripts alignment on reference genome without long intron transcripts, BED format. gffcmp.filter_alignm.bed.tmap - This tab delimited file lists the most closely matching reference transcript for each query transcript (gffcompare). Genomic position (classification) of candidate LncRNA genes in relation to protein-coding genes. (genes regarded as candidate lncRNAs belongs to three types ("i", "x" and "u")) lncRNA_before_loci.bed - the candidate lncRNA sequences before merging into loci, BED format. lncRNA_loci.bed - loci of the candidate lncRNA, BED format. gff.filter_alignm.bed.tmap -This tab delimited file lists the most closely matching reference transcript for each query transcript (gffcompare). Genomic position (classification) of LncRNA genes in relation to loci of the candidate lncRNA. (genes regarded as novel lncRNAs belongs to type ("=")) lncRNA_after_loci.bed - the candidate lncRNA sequences that are completely matched ("=" - class of gffcompare) with lncRNA loci, BED format. Lnc_aling_with_TE.tsv - the candidate lncRNA sequences that mapped on TE, TSV format. new_lncrna.fasta - identified lncRNA transcripts,FASTA format. (final result) new_LncRNA_loci.bed - loci of identified lncRNA transcripts,BED format. (final result) LncRNA annotation. blast.outfmt6 - Blastn results. Contain homologs with known lncRNAs sequences from the LncAPDB library. LncAPDB.fasta - lncRNA sequences of LncAPDB library in fasta format. index_and_newindex.fasta - index of PNRD, CANTATAdb, GREENC, PlncDB, EVLncRNA databases compared with new index for LncAPDB library. LncAPDB_vs_blast.csv - known lncRNAs that were predicted.


Institut citologii i genetiki SO RAN


Transcriptome, Pipeline, Database, Long Noncoding RNA