SARS-CoV-2 GISAID isolates (2020-06-17) genotyping VCF

Published: 25-07-2020| Version 1 | DOI: 10.17632/63t5c7xb4c.1
Doğa Eskier,
Aslı Suner,
Yavuz Oktay,
Gokhan Karakulah


VCF file containing filtered mutated sites in SARS-CoV-2 genomes obtained from GISAID EpiCoV, separated by individual mutations. The columns correspond to viral genome accession ID, nucleotide position in the genome, mutation ID (left blank in all rows), reference nucleotide, identified mutation, quality, filter, and information columns (all left blank), format (GT in all rows), column corresponding to reference genome (all 0, referring to reference nucleotide column), and columns corresponding to isolate genomes, with each row identifying the nucleotide in the POS column, and whether it is non-mutant (0), or the mutant indicated in the identified mutation column (1). The file is tab delimited, with 22546 rows including the names, and 30690 columns. The file was generated to test the hypothesis whether the five most common mutations in the SARS-CoV-2 genome replication complex proteins, nsps 7, 8, 12, and 14, significantly affect the mutation density of the virus over time and whether these affect the synonymous and nonsynonymous mutation densities differently. We discovered that mutations in nsp14, an exonuclease with error correcting capabilities, are most likely to be correlated with increased mutational load across the genome compared to wildtype SARS-CoV-2. These results were obtained by identifying the frequency of mutations across all isolates in genomic regions of interest, analyzing which of the twenty mutations (five per nsp) have a statistically meaningful relationship with the mutation density in the M and E genes (chosen due to being under little selective pressure), and identifying the synonymous and nonsynonymous genomic SNV density for isolates with any of the statistically meaningful mutations, as well as isolates with none of the identified mutations.