GC-EI-MS datasets of trimethylsilyl (TMS) and tert-butyl dimethylsilyl (TBDMS) derivatives for development of machine learning-based compound identification approaches

Published: 4 July 2022| Version 3 | DOI: 10.17632/j3z5bmvmnd.3
Contributors:
,
,

Description

In the field of environment and health studies, recent trends have focused on the identification of contaminants of emerging concern (CECs). This is a complex, challenging task, as compound databases (DBs) and mass spectral libraries (MSLs) resources concerning these compounds, i.e. resources are very poor. This is particularly true for semi-polar organic contaminants that have to be derivatized prior to gas chromatography-mass spectrometry (GC-MS) analysis with electron impact ionization (EI), for which it is barely possible to find any records. In particular, there is a severe lack of datasets of GC-EI-MS spectra generated and made publicly available for the purpose of development, validation and performance evaluation of cheminformatics-assisted compound structure identification (CSI) approaches, including novel cutting-edge machine learning (ML)-based approaches. In order to fill in this gap, we have generated 12 datasets of GC-EI-MS spectra of trimethylsilyl (TMS) and tert-butyldimethylsilyl (TBDMS) derivatives, which can be used to support machine learning-assisted CSI and to aid in cheminformatics-assisted identification of silylated derivatives in GC-MS laboratories working in the field of environment and health. The datasets include: - Four test datasets of raw (RAW) and processed (BS) GC-EI-MS spectra of TMS and TBDMS derivatives of CECs; - Two sets of corresponding metadata (one for TMS and one for TBDMS derivatives), which contain the IUPAC name, exact mass, molecular formula, InChI, InChIKey, SMILES and PubChemID of each CEC and the corresponding CEC-TMS or CEC-TBDMS derivative, where available; - Metadata for four datasets of GC-EI-MS spectra of TMS derivatives derived from the NIST 17 Mass Spectral Library, including an initial dataset selected from the library and three datasets generated by applying consecutive filtering approaches. These metadata files contain the name, InChIKey, molecular formula, CAS number, exact mass, molecular weight, the NIST number and the ID of the GC-EI-MS spectra. TMS derivatives_0.1 refers to the original dataset derived from the NIST 17 Mass Spectral Library, TMS derivatives_1.1 to the dataset resulting after a first filtering step, TMS derivatives_2. after a second and TMS derivatives_3.1 after a third and final filtering step; and - Metadata for four datasets of GC-EI-MS spectra of TBDMS derivatives derived from the NIST 17 Mass Spectral Library, including an initial dataset selected from the library and three datasets generated by applying three consecutive filtering steps. These metadata files contain the name, InChIKey, molecular formula, CAS number, exact mass, molecular weight, the NIST number and the ID of the GC-EI-MS spectra. TBDMS derivatives_0.1 refers to the original dataset derived from the NIST 17 Mass Spectral Library, TBDMS derivative_1.1 to the dataset resulting after a first filtering step, TBDMS derivative_2.1 after a second and TBDMS_3.1 after a third and final filtering step.

Files

Institutions

Institut Jozef Stefan

Categories

Analytical Chemistry, Mass Spectrometry, Environmental Analysis, Machine Learning, Omics

License