Filtered datasets of benzene, ethanol, formic acid dimer and fomepizole

Published: 25 April 2024| Version 1 | DOI: 10.17632/ypxnzbjk45.1
Bienfait Isamura,


The uploaded datasets contain filtered geometries of benzene (BZ), ethanol (ETL), formic acid dimer (FAD), and fomepizole (FPL) used in a recently submitted paper to demonstrate the transferability of hyperparameters in anisotropic GPR models. These models are trained on atomic energies and charges (but in general any multipole moment) of topological quantum atoms in roughly 10 000 conformations of each molecule of interest. Once trained, these models were deployed in FFLUX simulations which suggested that transfer learning models perform as good as direct learning ones, despite being trained much faster (up to an order of magnitude). The datasets can be used to train any machine learning model that will reproduce atomic energies and multipole moments. It also contains total electronic energies of each conformation computed at the B3LYP/aug-cc-pVTZ [BZ and ETL] and B3LYP/6-31+G(d,p) [FAD and FPL] levels of theory. These can be used directly to create surrogate potential energy surface (PES) models.


Steps to reproduce

These data were obtained in three main steps: Step 1: Initial geometries of benzene (BZ) and ethanol (ETL) were extracted from the MD-17 database, while those of formic acid dimer (FAD) and fomepizole (FPL) were obtained via unbiased semi-empirical molecular dynamics simulations at 300 K using the GFN2-xTB method as implemented in the atomic simulation environment (ASE) Python package. Step 2: Each geometry was then encoded as a vector of fixed length using the so-called atomic local frame (ALF) representation. Additionally, topological QTAIM/IQA properties were calculated using the AIMAll19 software. BZ and ETL were treated at the B3LYP/aug-cc-pVTZ level, while FAD and FPL were described at B3LYP/6-31+G(d,p) level. Step 3: Both the input features (ALF vectors) and physical properties were combined into a unique dataset. The latter dataset was filtered by removing all the geometries where the WFN molecular energy and charge could not be reconstructed within 1 kJ/mol and 1 me from atomic IQA energies and charges.


The University of Manchester


Machine Learning, Computational Chemistry


UK Research and Innovation