Machine Learning of MS dataset

Published: 19 November 2021| Version 1 | DOI: 10.17632/2rcc8488hx.1
Anna Kaysheva,


Mass spectrometric profiling provides information on the protein and metabolic composition of biological samples. However, the weak efficiency of computational algorithms in correlation between tandem spectra to molecular components (proteins and metabolites) dramatically limits the use of "omics" data for the classification of nosologies. The development of machine learning methods for the intelligent analysis of raw mass spectrometric HPLC-MS/MS measurements without data preprocessing and identification seems promising. In this study, we tested the ap-plication of neural networks of two types, 1D-Residual CNN and 3D-CNN, to the combined metabolomic and proteomic HPLC-MS/MS data for the classification of three cancer phenotypes. Both neural networks are capable of classifying as gender-mixed oncological phenotypes (kidney cancer) as gender-specific phenotypes (ovarian cancer) and recognize healthy condition accuracy of 0.95 by analyzing ‘omics’ data in the ‘mgf’ data format. The neural network makes possible to determine their similarity degree (distance matrix) between submitted phenotypes, thus over-coming algorithmic barriers in identifying HPLC-MS/MS spectra. The closest distance was shown between ovarian cancer, kidney cancer, and prostate cancer/kidney cancer, whereas the healthy phenotype was the most outer from cancer phenotypes. Neural networks are versatile and can be applied to standard experimental data formats of different analytical platforms.


Steps to reproduce

). The mass spectrometric intensity and mass-to-charge ratios encoded in ‘mgf’ files were chosen as key descriptors and were aligned by the retention time (RT) scale with a 0.1-sec step. The training dataset size comprised 60%, or 180 ‘mgf’ files, of the complete dataset, whereas the test dataset comprised 40%, or 120 ‘mgf’ files. Both training and test datasets comprised from a part of collected pathologies and a part of the control group in equal proportions. Discrimination of different stages within a particular cancer type was not carried out due to a small size of study population. Noise reduction was performed in the very initial step of data handling and included the following steps: (a) extraction of retention time, intensity, and mass-to-charge features; (b) elimination of rare m/z features and intensities such that the frequency of each feature exceeded 2 in each dataset; (c) rounding each m/z feature to 10 ppm for proteomic data and 100 ppm for metabolomic data; (d) normalization of intensity and m/z features using the min-max scalar, and noise reduction using the elliptic envelope approach with an outlier fraction cut-off not exceeding 0.2, assuming that the weighted average error of mass spectrometric measurements is below 20%. The converted data of the mass spectrometric signal were saved in the database (the document-oriented Mongo DB was used as the environment), and two models with dif-ferent architectures were developed and tested with distinct options. The first model (1D-Residual CNN) operates with raw mass spectrometric signal recorded for proteins and metabolites. The input data for this model are presented as four arrays: (1) m/z and (2) intensity for proteomic analysis, (3) m/z and (4) intensity for meta-bolic data. The second model (3D-CNN) operates with MS signal represented as a se-quence of spectra images, in which each image catches a portion of the initial spectrum signal with a duration of 96.7 seconds. Each point of the resulting output image is coor-dinated by retention time and m/z values, and the signal intensity is color-coded. Thus, for a single ‘mgf’ file a sequence of 98 images with dimensions of 512 × 512 pixels was ob-tained. The total number of images for the proteomic dataset was 63,069, and 54,752 im-ages were generated for the metabolomic dataset. The code source designed for 1DCNN model was deposited in the GitHub at the following link: The code source designed for 3DCNN model was deposited in GitHub at the following link:


Medicine, Oncology, Machine Learning