Genome Diversity and the Origin of the Arabian Horse
The Arabian horse, one of the world’s oldest breeds of any domesticated animal, is characterized by natural beauty, graceful movement, athletic endurance, and, as a result of its development in the arid Middle East, the ability to thrive in a hot, dry environment. Here we studied 378 Arabian horses from 12 countries using equine single nucleotide polymorphism (SNP) arrays and whole-genome re-sequencing to examine hypotheses about genomic diversity, population structure, and the relationship of the Arabian to other horse breeds. We identified a high degree of genetic variation and complex ancestry in Arabian horses from the Middle East region. Also, contrary to popular belief, we could detect no significant genomic contribution of the Arabian breed to the Thoroughbred racehorse, including Y chromosome ancestry. However, we found strong evidence for recent interbreeding of Thoroughbreds with Arabians used for flat-racing competitions. Genetic signatures suggestive of selective sweeps across the Arabian breed contain candidate genes for combating oxidative damage during exercise, and within the “Straight Egyptian” subgroup, for facial morphology. Overall, our data support an origin of the Arabian horse in the Middle East, no evidence for reduced global genetic diversity across the breed, and unique genetic adaptations for both physiology and conformation. This deposition includes sample annotation and Plink format filesets for genotyping data generated for the associated publication. Full methods can be found in the publication, in summary: Genotype calls from each genotyping array batch and the whole genome sequences were combined sequentially using PLINK v. 1.90 . First, Affymetrix and GeneSeek calls were merged using PLINK with filters set to 90% SNP genotyping rate and 1% minor allele frequency. A subset of multi-origin ancestry Arabians (from mixed origins) were used to test all SNPs for Hardy Weinberg Equilibrium. Autosomal variants with P-values < 0.005 (corrected for multiple testing) were removed from the genotype files. SNPs derived from whole genome sequencing were then merged to generate the final set of variant calls for downstream analysis, removing any variant with lower than 80% genotyping rate, 1% minor allele frequency, or flagged as multi-allelic. Finally, samples with genotyping rate <95% were removed. After applying these filters, the data set included 343,367 SNPs.