List of potential pleiotropy mutations in SARS-CoV-2 Evolution

Published: 3 May 2024| Version 1 | DOI: 10.17632/psshk9m8cb.1
Jui-Hung Tai,


By analyzing a large set of SARS-CoV-2 sequences (~ 2 million) collected from early 2020 to mid-2021, we found that high frequency mutations within hosts are sometimes detrimental during between-host transmission. This highlights potential inverse selection pressures within- versus between-hosts. We also identified a group of nonsynonymous changes with their frequencies are significantly higher than neutral expectation, yet they have never experienced clonal expansion (10-3 - 10-2). These mutations are likely maintained by pleiotropy, a condition where mutations increase some components of fitness at a cost to others.


Steps to reproduce

The data collection and preprocessing are as previously described (Ruan, et al. 2022). In brief, we downloaded 1,929,395 SARS-CoV-2 genomes from the GISAID database ( as of July 5, 2021 and aligned to the Wuhan-Hu-1 reference sequence (EPI_ISL_402125) using MAFFT (Katoh and Standley 2013) (--auto --keeplength). We used snp-sites (-v; (Page, et al. 2016)) to identify single nucleotide polymorphisms (SNPs) and bcftools (merge -force-samples -O v) to merge the vcf files. We identified 65,673 SNPs in coding regions. Because our dataset includes most of the possible mutations at each site, mutation counts in each category mainly reflect the nucleotide composition of the virus genome and do not directly reflect mutation prevalence, thus the frequency of nucleotide change was used as a proxy to estimate the mutation prevalence across types. To define sub-high frequency mutations, we first calculated average mutation frequency of four-fold degenerate sites (7.4×10-4). With 4,236 four-fold degenerate sites and standard deviation of 1.26×10-2, the 95% confidence interval for four-fold degenerate site frequency is 3.6×10-4 to 1.1×10-3. Mutations that appeared more often than 10-3 are considered as sub-high frequency for convenience. Due to the sample size varied across the collected months, we weighted the mutation frequency for each month based on its respective sample size. The table was sorted by coefficient of variation.


National Taiwan University College of Medicine


Data Analysis, Severe Acute Respiratory Syndrome Coronavirus 2