Contrasting population genetics of co-endemic cattle- and buffalo- derived Theileria annulata
Description
The Illumina Mi-Seq post-run processing uses the barcoded indices to split all sequences by sample and generate FASTQ files. These were analysed using Mothur v1.39.5 software (Schloss et al., 2009) with modifications in the standard operating procedures of Illumina Mi-Seq (Kozich et al., 2013) in the Command Prompt pipeline as described by (Rehman et al., 2020) and (Sargison et al., 2019). Briefly, the raw paired-end reads were analysed to combine the two sets of reads for each T. annulata parasite population using make.contigs command, which requires ‘stability.files’ as an input. The ‘make.contigs’ command extracts sequence quality score data from FASTQ files, creating complements of the reverse and forward reads and joins them into contigs. It aligns the pairs of sequence reads and compares the alignments to identify any positions where the two reads disagree. Subsequently, there was a need to remove any sequences with ambiguous bases using the ‘screen.seqs’ command. The resulting dataset was aligned with a T. annulata cytochrome b reference sequence library prepared from the positive controls (fSupplementary Data S1) using the ‘align.seqs’ command. To confirm that these filtered sequences overlap the same region of the reference sequence library, the ‘screen.seqs’ command was run to show the sequences ending at the 517 bp of cytochrome b positions. Once all sequence reads were classified as T. annulata, a count list of the consensus sequences of each population was created using the ‘unique.seqs’ command, followed by the use of the pre.cluster command to look for sequences having up to two differences and to merge them in groups based on their abundance. Chimeras were identified and removed using the chimera.vsearch commands. The count list was further used to create FASTQ files of the consensus sequences of each sample using the ‘split.abund’ command to sort data into groups of rare and abundant based on the cut-off value, followed by the ‘split.groups’ command (Supplementary Data S2). Those samples yielding more than 1000 reads (implying sufficient gDNA for accurate amplification) were included in the cut-off value.