Variantion dataset of papaya
The VCF file was population variation data of papaya, which included cleaned SNPs and InDels across the whole papaya genome.
Steps to reproduce
Across the 86 papaya accessions, 4.3 billion of 100 or 250 bp paired end reads (734 Gb) were sequenced by Illumina HiSeq 2500.Duplicated and low-quality reads were filtered out using Picard tools (picard.sourceforge.net; v2.0.1) and Trimmomatic1 (a java software, v0.38), respectively. Filtered reads was aligned to reference with BWA-mem (v0.7.17-r1188), and Sam output file was converted to Bam format with SAM tools suite2. SNP calling was processed with GenomeAnalysisTookit (v3.5, https://software.broadinstitute.org/gatk/). The detection of variant was performed utilizing GATK (the genome analysis toolkit) following the prescribed precedures work process for variation disclosure. BAM files were locally realigned utilizing the IndelRealigner to eliminate erroneous mismatches around small-scale deletions and insertions. To assess the SNPs and InDels for putative diploids, the HaplotypeCaller was used with default parameters. The raw resulting vcf was filtered to remove variants with quality scores less than 100, minimum allele frequency of >2, and max missing data of 0.9. The final variant list contains 1,535,099 high-quality single-nucleotide polymorphisms (SNPs) and 110, 477 small insertions and deletions (less than 8 bp). The identified SNPs were annotated as variations in intergenic regions, coding sequences or introns using SnpEff3 (v5.0). Methods-only references 1.Bolger, A.M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics (Oxford, England) 30, 2114-20 (2014). 2.Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754-60 (2009). 3.Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics 81, 559-575 (2007).