SARS-CoV-2 GISAID isolates (2020 - 05 - 24) genotyping VCF by mutation

Published: 5 June 2020| Version 1 | DOI: 10.17632/jv87xwj7fv.1
Doğa Eskier,


VCF file containing filtered mutated sites in SARS-CoV-2 genomes obtained from GISAID EpiCoV, separated by individual mutations. The columns correspond to viral genome accession ID, nucleotide position in the genome, mutation ID (left blank in all rows), reference nucleotide, identified mutation, quality, filter, and information columns (all left blank), format (GT in all rows), column corresponding to reference genome (all 0, referring to reference nucleotide column), and columns corresponding to isolate genomes, with each row identifying the nucleotide in the POS column, and whether it is non-mutant (0), or the mutant indicated in the identified mutation column (1). The file is tab delimited, with 17345 rows including the names, and 19665 columns. The file was generated to test the hypothesis whether top of the most common mutations in the SARS-CoV-2 genome, 14408 C > T and 23403 A > G, significantly affect the mutation density of the virus over time and whether these affect the synonymous and nonsynonymous mutation densities differently. We discovered that the mutation densities between nonsynonymous and synonymous mutations show significant differences over early and late periods between WT (wildtype for both nucleotides of interest) and MT (mutant for both nucleotides of interest) samples, with nonsynonymous mutations especially showing higher increase in density in late period in MT samples. These results were obtained by identifying the earliest co-occurrence of the mutations in the two countries with the highest number of mutations, separating the isolates from these countries that were sequenced after the earliest co-occurrence date into two time groups, early and late, as well as two selecting those that fit two phenotypes into two categorical variables, WT and MT, and all known mutations into synonymous and non-synonymous mutation categorial variables. The relationships between these categories, along with the density of synonymous and nonsynonymous SNVs both across the genome and per gene locus, as well as the RdRp coding region, were analysed across time.



Virology, Genomics, Computational Genomics, Biostatistics, Computational Biology