Unassigned Mass Spectrometry Data for Machine Learning

Published: 22 November 2021| Version 1 | DOI: 10.17632/ycw25mpjb6.1
Kristina Malsagova


The development of machine learning methods for the intelligent analysis of raw mass spec-trometric HPLC-MS/MS measurements without data preprocessing and identification seems promising. In this study, we tested the application of neural networks of two types, 1D-Residual CNN and 3D-CNN, to the combined metabolomic and proteomic HPLC-MS/MS data for the classification of three cancer phenotypes. Both neural networks are capable of classifying as gender-mixed oncological phenotypes (kidney cancer) as gender-specific phenotypes (ovarian cancer) and recognize healthy condition accuracy of 0.95 by analyzing ‘omics’ data in the ‘mgf’ data format. The neural network makes possible to determine their similarity degree (distance matrix) between submitted phenotypes, thus overcoming algorithmic barriers in identifying HPLC-MS/MS spectra. The closest distance was shown between ovarian cancer, kidney cancer, and prostate cancer/kidney cancer, whereas the healthy phenotype was the most outer from cancer phenotypes. Neural networks are versatile and can be applied to standard experimental data formats of different analytical platforms.


Steps to reproduce

The model was elaborated according to the 3D-Convolution architecture, which is frequently used to classify a sequence of images, such as video signals (where multiple image frames are concatenated across a temporal dimension to provide a 3D spatial in-put), medical image slices (MRI), etc. The kernel shape for a 3D convolution is specified along three dimensions: depth, height, and width. When considering the convolution op-eration in terms of a kernel sliding across a multidimensional input array, the kernel slides in three directions. At every step, the dot product is calculated, which provides a 3D output as well. The input data for this model were a sequence of spectrum images. The model was trained contemporaneously on the proteins and metabolites spectra data which is a dis-tinctive feature of this network. Consequently, the input parameter of this model can be both spectrum of proteins and spectrum of metabolites, and the probability of the spec-trum belonging to one or another class of pathology is determined at the output layer. The sequence of images was subjected to additional augmentation using one of the randomly selected transformations (random shift of image elements vertically and hori-zontally, random zeroing of image elements, random crop) for 30% of the generated im-ages set. Every image in the sequence was reduced to a size of 256×256 pixels after aug-mentation and, finally, 3D objects with 256×256×98 pixels dimensionality were arranged and fed to the input of the neural network.The 3D convolution model consists of a sequence of eight 3D-Convolution layers with sub-sampling layers (maxpool3D). The convolution layers are connected in series with two output dense layers. After each convolution layer, a batch normalization layer was applied. The parametric ReLU (PReLU) function was used as an activation function. Adam’s optimizer with a reduced learning rate was used when the accuracy metric stopped improving (started from 1×10−3 learning rate with the reduce factor equal to 0.5). Multiclass cross entropy was selected as the loss function. The code source designed for the pathology classification was deposited in the open-access GitHub resource and is currently available at the following link: https://github.com/Denis21800/Pathology-classification_V2.git


Naucno-issledovatel'skij institut biomedicinskoj himii imeni V N Orehovica


Health Sciences, Learning, Informatics