SARS-CoV-2 GISAID isolates (2020-05-05) genotyping VCF

Published: 19-05-2020| Version 1 | DOI: 10.17632/x4t94w9njt.1
Doğa Eskier,
Gökhan Karakülah,
Aslı Suner,
Yavuz Oktay


VCF file containing filtered mutated sites in SARS-CoV-2 genomes obtained from GISAID Epi-CoV. The columns correspond to viral genome accession ID, nucleotide position in the genome, mutation ID (left blank in all rows), reference nucleotide, potential mutations (numbered in order of appearance), quality, filter, and information columns (all left blank), format (GT in all rows), column corresponding to reference genome (all 0, referring to reference nucleotide column), and columns corresponding to isolate genomes, with each row identifying the nucleotide in the POS column, and whether it is non-mutant (0), or if mutant, which mutation it carries (an integer value equal to or greater than 1, corresponding to the number of appearance of the mutation in the ALT column. The file is tab delimited, with 5892 rows including the names, and 11911 columns. The file was generated to test the hypothesis whether mutations in the RdRp protein of SARS-CoV-2 significantly affect the mutation rate of the virus by examining their correlation to the mutation load of the membrane or envelope proteins. The results indicate that the most common mutation, 14408C>T, increases the mutation rate, while the other common mutations can lower the mutation rate. These results were obtained by examining the number of isolates with a standard nucleotide call disagreeing with the reference sequence for each mutated nucleotide, and separating them into categorical variables, to be analysed along with isolate date and location.