GC-EI-MS datasets of trimethylsilyl (TMS) and tert-butyl dimethylsilyl (TBDMS) derivatives for development of machine learning-based compound identification approaches

Published: 27 May 2022| Version 2 | DOI: 10.17632/j3z5bmvmnd.2


In the field of environment and health studies, recent trends have focused on the identification of contaminants of emerging concern (CECs). This is a complex, challenging task, as compound databases (DBs) and mass spectral libraries (MSLs) resources concerning these compounds, i.e. resources are very poor. This is particularly true for semipolar organic contaminants that have to be derivatized prior to gas chromatography-mass spectrometry (GC-MS) analysis with electron impact ionization (EI), for which it is barely possible to find any records. In particular, there is a severe lack of datasets of GC-EI-MS spectra generated and made publicly available for the purpose of development, validation and performance evaluation of cheminformatics-assisted compound structure identification (CSI) approaches, including novel cutting-edge machine learning (ML)-based approaches. In order to fill in this gap, we have generated four datasets of GC-EI-MS spectra of trimethylsilyl (TMS) and tert-butyldimethylsilyl (TBDMS) derivatives, in order to support machine learning-assisted compound identification and to aid in cheminformatics-assisted identification of silylated derivatives in GC-MS laboratories working in the field of environment and health. The datasets include raw (RAW) and processed (BS) datasets of GC-EI-MS spectra of TMS and TBDMS derivatives of CECs, along with their corresponding metadata, which contain the IUPAC name, exact mass, molecular formula, InChI, InChIKey, SMILES and PubChemID of each CEC and the corresponding CEC-TMS or CEC-TBDMS derivative, where available.



Institut Jozef Stefan


Analytical Chemistry, Mass Spectrometry, Environmental Analysis, Machine Learning, Omics