SARS-CoV-2 GISAID UK-US isolates (2020-09-07) genotyping VCF

Published: 16-11-2020| Version 1 | DOI: 10.17632/5dfj2hhnng.1
Necla Koçhan,
Doğa Eskier,
Aslı Suner,
Gökhan Karakülah,
Yavuz Oktay


VCF files containing filtered mutated sites in SARS-CoV-2 genomes obtained from GISAID EpiCoV and submitted from the UK and the US, separated by individual mutations. The columns correspond to viral genome accession ID, nucleotide position in the genome, mutation ID (left blank in all rows), reference nucleotide, identified mutation, quality, filter, and information columns (all left blank), format (GT in all rows), column corresponding to reference genome (all 0, referring to reference nucleotide column), and columns corresponding to isolate genomes, with each row identifying the nucleotide in the POS column, and whether it is non-mutant (0), or the mutant indicated in the identified mutation column (1). The files is tab delimited, with the UK file having 12696 rows including the names, and 18135 columns, and the US file having 15588 rows including the names, and 16277 columns. The file was generated to test the hypothesis whether the different SARS-CoV-2 genes or protein coding regions are positively or negatively selected differently between 14408C>T / 23403A>G double mutants and double wildtype isolates, using mutation rate models, and whether regional distributions affect the mutation rates. Our findings have shown that the RdRp coding region and the S gene show the highest amount of selection across viral generations, and that different countries can affect the synonymous and nonsynonymous mutation rates for individual genes.