Identification of Stable Genes in Arabidopsis thaliana for expression analysis experiment Normalization: A Bioinformatic Pipeline

Published: 30 October 2023| Version 1 | DOI: 10.17632/2nk727nkx7.1
Jason Suescum H,


We have developed a bioinformatic pipeline to identify the most stable genes in a collection of curated transcriptomic experiments for the model plant Arabidopsis thaliana. The pipeline consists of two parts. The first part is a mathematical algorithm that filters numerical conditions that a proper reference gene should meet, such as minimal expression, variation, and replicability among experiments. The second part integrates a biological filter that uses reference gene-associated biological processes as landmarks to select potential genes, thus avoiding genes with uncharacterized functions and volatile expression patterns resulting from shifting biological processes.


Steps to reproduce

The methodology begins with the selection of data from the Plant Public RNA-seq Database (ARS), which encompasses a vast repository of approximately 28,000 RNA-sequencing experiments related to Arabidopsis. From this extensive dataset, 8029 carefully curated data points are chosen to develop a bioinformatic pipeline. These data are categorized into three distinct groups: tissue-specific experiments, biotic stress experiments, and abiotic stress experiments. Within these groups, priority is given to experiments that involve specific stressors or treatments and include corresponding control replicates. To ensure data integrity, non-coding elements are meticulously removed, resulting in the retention of 27,655 protein-coding genes. The mathematical algorithm employed in this methodology consists of a series of filters. The first filter focuses on genes with expression levels equal to or greater than 1.0 FPKM in at least 75% of all libraries for each condition. Genes that do not meet this criterion are discarded. The second filter assesses the median expression in FPKM for each gene across all libraries associated with a particular condition or tissue, including control replicates. Genes with expression levels below 20.0 FPKM are eliminated. The third filter introduces a proportion index for Most Stable Genes (MSG), which is calculated using mean and standard deviation values of FPKM. This index is subject to a parameter (γ) that sets the maximum allowed variation. The biological filter involves the utilization of the ShinnyGo v.0.77 database for functional enrichment analysis. For each category of different treatments under examination, the analysis focuses on molecular function and enriched KEGG metabolic pathways. A minimum adjusted p-value of less than 0.05, achieved through the hypergeometric test, is applied to select significantly enriched terms. A list of genes previously recognized as housekeeping genes in plant species is compiled and their Gene Ontology (GO) categories are established according to established parameters. A comparison is then made between these identified genes and those from the functional enrichment analysis. Genes with GO ID category p-values below 0.05 and shared by both groups are retained. Any duplicated genes and GO terms are subsequently removed, and the results are summarized. This comprehensive methodology serves to identify and validate reference genes in Arabidopsis by integrating data preparation, mathematical filters, and biological validation.


Pontificia Universidad Javeriana - Cali


Bioinformatics, RNA Sequencing, Pipeline