True External Set of 23,592 Compounds for Acute Oral Toxicity Prediction Across Six Typical Drug Scaffolds

Published: 20 March 2026| Version 1 | DOI: 10.17632/bsty5cw86h.1
Contributor:
Jianing XU

Description

This dataset comprises 23,592 real-world compounds retrieved from the PubChem database, specifically curated to serve as a rigorous "True External Set" for evaluating Quantitative Structure-Toxicity Relationship (QSTR) and Machine Learning (ML) models. The dataset encompasses six privileged heterocyclic scaffolds highly prevalent in modern medicinal chemistry: pyrazine, piperazine, thiazole, thiophene, indole, and benzimidazole (containing approximately 2,000 to 4,000 compounds per sub-library). All included molecules lack experimentally determined acute oral toxicity values (in rats and mice) and were strictly excluded from the training or validation phases of the developed predictive models to prevent data leakage. This dataset aims to provide a standardized benchmark for assessing model applicability domains and generalizability in computational toxicology. Furthermore, it offers a robust chemical space for discovering novel,thereby supporting the green design and life-cycle risk management of modern pharmaceuticals.

Files

Categories

Computational Toxicology

Licence