Data for: Ensemble Learning: Predicting Human Pathogenicity of Hematophagous Arthropod Vector-Borne Viruses

Published: 26 January 2026| Version 1 | DOI: 10.17632/hvk9f2by6k.1
Contributor:
熔熔

Description

Overview: This dataset supports the study on predicting the human pathogenicity of viruses carried by blood-feeding arthropods (mosquitoes, ticks, etc.) using ensemble learning. It integrates large-scale epidemiological data with genomic functional annotations to assess zoonotic spillover risks.   Dataset Components: Epidemiological Characteristics Dataset: Covers 294 viruses and 37 distinct features categorized into: Virus Properties: Baltimore classification and taxonomy.   Vector Host Features: Family and genus of vectors (e.g., Culicidae, Ixodidae).   Non-vector Host Diversity: Distribution across 15 groups, emphasizing the impact of Perissodactyla and Carnivora orders on pathogenicity.   Viral Sequence Pathogenic Function Dataset: Includes functional annotations for 71,623 viral sequences. Using SeqScreen, 10 key Functional Signatures of Concern (FunSoCs) were identified, such as:   Viral Adhesion: Found in 62% of sequences, crucial for host cell entry.   Host Xenophagy & Viral Counter Signaling: Key features for immune evasion.   Viral Invasion: Associated with non-pathogenic traits in this specific context.   Technical Application: The data were utilized to develop and validate XGBoost-based models: Regression Model: Achieved an R² of 90.6%, correlating host diversity with pathogenicity.   Classification Model: Achieved an F1 score of 96.79% for identifying pathogenic potential at the sequence level.   External Validation: Includes predictions for 228 sequences, highlighting potential risks from Palma and Zaliv Terpeniya viruses.   Research Value: This resource allows for the strain-level prediction of pathogenicity within metagenomic data, providing a robust framework for early warning systems of emerging zoonotic threats.

Files

Categories

Genome Annotation, Metadata, Data Reference Materials, Epidemiological Research

Licence