Data for: Ensemble Learning: Predicting Human Pathogenicity of Hematophagous Arthropod Vector-Borne Viruses
Description
Overview: This dataset supports the study on predicting the human pathogenicity of viruses carried by blood-feeding arthropods (mosquitoes, ticks, etc.) using ensemble learning. It integrates large-scale epidemiological data with genomic functional annotations to assess zoonotic spillover risks. Dataset Components: Epidemiological Characteristics Dataset: Covers 294 viruses and 37 distinct features categorized into: Virus Properties: Baltimore classification and taxonomy. Vector Host Features: Family and genus of vectors (e.g., Culicidae, Ixodidae). Non-vector Host Diversity: Distribution across 15 groups, emphasizing the impact of Perissodactyla and Carnivora orders on pathogenicity. Viral Sequence Pathogenic Function Dataset: Includes functional annotations for 71,623 viral sequences. Using SeqScreen, 10 key Functional Signatures of Concern (FunSoCs) were identified, such as: Viral Adhesion: Found in 62% of sequences, crucial for host cell entry. Host Xenophagy & Viral Counter Signaling: Key features for immune evasion. Viral Invasion: Associated with non-pathogenic traits in this specific context. Technical Application: The data were utilized to develop and validate XGBoost-based models: Regression Model: Achieved an R² of 90.6%, correlating host diversity with pathogenicity. Classification Model: Achieved an F1 score of 96.79% for identifying pathogenic potential at the sequence level. External Validation: Includes predictions for 228 sequences, highlighting potential risks from Palma and Zaliv Terpeniya viruses. Research Value: This resource allows for the strain-level prediction of pathogenicity within metagenomic data, providing a robust framework for early warning systems of emerging zoonotic threats.