CN-120524271-B - Original mass spectrum data classification method based on multichannel embedded representation

CN120524271BCN 120524271 BCN120524271 BCN 120524271BCN-120524271-B

Abstract

The invention provides an original mass spectrum data classification method based on multi-channel embedded representation, which belongs to the technical field of mass spectrum data analysis and deep learning, and comprises the steps of obtaining original mass spectrum data, carrying out box division processing and normalization processing on the original mass spectrum data to obtain standardized intensity data, carrying out global feature dimension reduction on the original mass spectrum data through a multi-channel embedded module to generate global embedded representation, inputting the global embedded representation into a channel embedded sub-module, extracting local structure information, carrying out channel splicing operation, fusing global and local features to form the multi-channel embedded representation, constructing a learnable deep learning front module, carrying out end-to-end joint optimization training through an integrated deep learning classification model, realizing data-driven self-adaptive feature optimization, and obtaining classification results; according to the invention, through multi-channel feature cooperation and dimension compression, the classification performance is improved, the calculation cost is reduced, and the efficiency of mass spectrum data classification and the accuracy of classification tasks are improved.

Inventors

ZHANG FENGYI
Xiong xingchuang
WANG YINCHU
GUO LIN
ZHANG WEI
GAO BOYONG

Assignees

中国计量科学研究院

Dates

Publication Date: 20260512
Application Date: 20250528

Claims (6)

1. The original mass spectrum data classification method based on the multichannel embedded representation is characterized by comprising the following steps of: s1, acquiring original mass spectrum data, and performing box division processing and normalization processing on the original mass spectrum data to obtain standardized intensity data; s2, constructing a multi-channel embedded module, and structurally connecting the multi-channel embedded module serving as a characteristic representation layer with a preset classification model to obtain a connected preset classification model, wherein the method specifically comprises the following steps of: s201, constructing a multi-channel embedding module by utilizing encoder sub-modules, channel embedding sub-modules and channel splicing operation; S202, based on an end-to-end structure, taking a multichannel embedded module as a characteristic representation layer, and performing structural connection with a preset classification model to obtain a connected preset classification model; S3, carrying out joint optimization training on the parameters of the connected preset classification model according to the standardized intensity data, and obtaining the trained preset classification model by adopting an early-stop mechanism, wherein the method specifically comprises the following steps: s301, dividing standardized intensity data into a training set, a verification set and a test set by using a hierarchical random sampling strategy; s302, taking mass spectrum data in a training set as input data, inputting the input data into a multi-channel embedding module for processing, and generating an optimal multi-channel embedding representation, wherein the method specifically comprises the following steps: s3021, inputting mass spectrum data in a training set as input data into a multi-channel embedding module, compressing the dimension of the input data to the middle dimension by utilizing a first full-connection layer in an encoder sub-module, and extracting features by utilizing a layer normalization and correction linear unit and a random inactivation unit to obtain an initial potential embedding representation; S3022, compressing input data from an intermediate dimension to an embedded dimension by using a second full connection layer in the encoder sub-module, and generating a global embedded representation; S3023, remolding the global embedded representation into a three-dimensional tensor, extracting features by using a first one-dimensional convolution layer in a channel embedding sub-module to obtain an initial first multi-channel embedded representation with half of the number of preset channels, and processing the initial first multi-channel embedded representation by batch normalization, correction of a linear unit activation function and random inactivation to obtain an enhanced first multi-channel embedded representation; S3024, performing feature extraction on the enhanced first multi-channel embedded representation by using a second one-dimensional convolution layer in the channel embedded sub-module to obtain an initial second multi-channel embedded representation of a preset channel number, and obtaining the enhanced second multi-channel embedded representation by batch normalization, correction of a linear unit activation function and random inactivation; S3025, splicing the global embedded representation remodeled into the three-dimensional tensor with the enhanced second multi-channel embedded representation by using the channel residual error connection submodule to obtain an optimal multi-channel embedded representation; The expression for stitching the global embedded representation reshaped into a three-dimensional tensor with the enhanced second multi-channel embedded representation is as follows: Wherein, the The representation is an optimal multi-channel embedded representation, The operation of the splice is indicated and, The representation is remodelled into a globally embedded representation of a three-dimensional tensor, , The enhanced second multi-channel embedded representation is represented, Representing the dimensions of the various dimensions of the three-dimensional tensor, The size of the batch is indicated and, The dimension 1 is represented as a dimension which, The dimensions of the embedding are represented, Representing a user-defined number of channels; s303, inputting the optimal multichannel embedded representation into a preset classification model for joint optimization training, verifying through a verification set, testing the classification model of optimal parameters by using a test set by adopting an early-stop mechanism, and obtaining a trained preset classification model; and S4, classifying the standardized intensity data by using a trained preset classification model to obtain a classification result.
2. The method of classifying raw mass spectrometry data based on a multi-channel embedded representation of claim 1, wherein S1 comprises the steps of: S101, acquiring original mass spectrum data, and classifying the original mass spectrum data according to data characteristics of different mass spectrum preparation technologies to obtain first mass spectrum data and second mass spectrum data; S102, carrying out box division processing on first mass spectrum data according to a mass-to-charge ratio defined by a user to form a box division interval, and taking mass spectrum data with total ion intensity greater than a preset threshold value in the first mass spectrum data as a first training sample based on the box division interval; S103, carrying out box division processing on the second mass spectrum data according to the mass-to-charge ratio defined by the user to form a box division interval, carrying out signal alignment on the second mass spectrum data by taking the retention time defined by the user as a window based on the box division interval, and carrying out superposition averaging on the signal intensity values of the same box division interval to obtain a second training sample; S104, carrying out total ion number normalization on each first training sample and each second training sample to obtain standardized intensity data.
3. The method of classifying raw mass spectrometry data based on a multi-channel embedded representation of claim 2, wherein the first mass spectrometry data is water-assisted laser desorption or ionization technique mass spectrometry data; the second mass spectrum data is liquid chromatography-mass spectrometry combined technology mass spectrum data.
4. The method of classifying raw mass spectrometry data based on a multi-channel embedded representation of claim 1, wherein S303 comprises the steps of: s3031, inputting the multi-channel embedded representation into a preset classification model, and updating parameters of the preset classification model by using an adaptive moment estimation optimizer based on a staged optimization strategy; S3032, according to the verification set, dynamically adjusting the learning rate of the self-adaptive moment estimation optimizer by using a preset learning rate scheduler, and attenuating the learning rate according to preset attenuation multiples in response to the fact that the loss of the verification set is not reduced by continuous preset attenuation rounds; S3033, calculating the accuracy rate and F1 score of the verification set, saving the model weight of the optimal performance, reversely distributing the weight of the loss function according to the class frequency of the training set, calculating the verification loss by using a weighted cross entropy loss function according to the verification set, adopting an early-stopping mechanism, responding to the verification loss to continuously preset early-stopping turn without descending, stopping training, and performing step S3035; S3034, judging whether a preset training period is reached, if not, returning to the step S3031 to perform the joint optimization training again, and if so, performing the step S3035; s3035, testing the classification model of the optimal parameters by using the testing set according to the weight of the optimal performance model stored in the last round to obtain a trained preset classification model.
5. The method for classifying raw mass spectrum data based on multi-channel embedded representation according to claim 1, wherein the preset classification model comprises a convolutional neural network, a long-term and short-term memory network and a converter model; the convolutional neural network takes multi-channel embedded representation as multi-channel input, and gradually fuses global distribution and local peak details in the multi-channel by using hierarchical convolution and pooling operation to finish classification; The long-short-period memory network takes multi-channel embedded representation as a time sequence input model, each time step corresponds to a characteristic vector of one channel, a long-short-period memory layer is utilized to capture a time sequence mode of a cross channel, and a full-connection layer is utilized to map the hidden state of the last time step to a classification label so as to finish classification; The converter model expands the multi-channel embedded representation according to the embedded dimension, takes the multi-channel embedded representation as multi-sequence input, adds a learnable preset mark at the initial position of the sequence to obtain an input sequence, utilizes a multi-layer multi-head self-attention module to form an encoder, captures the cross-channel dependency relationship of the input sequence, and utilizes a full-connection layer to map by extracting the hidden state corresponding to the preset mark to obtain a classification result.
6. A raw mass spectrometry data classification system based on a multi-channel embedded representation for performing the raw mass spectrometry data classification method based on a multi-channel embedded representation of any of claims 1-5, comprising: the data preprocessing subsystem is used for carrying out box division processing and normalization processing on the original mass spectrum data; The multi-channel embedding subsystem is used for constructing a multi-channel embedding module, and taking the multi-channel embedding module as a characteristic representation layer to be structurally connected with a preset classification model; and the classification subsystem is used for carrying out joint optimization training on the parameters of the connected preset classification model and executing a classification prediction step.

Description

Original mass spectrum data classification method based on multichannel embedded representation Technical Field The invention belongs to the technical field of mass spectrum data analysis and deep learning, and particularly relates to an original mass spectrum data classification method based on multichannel embedded representation. Background Mass spectrometry is an efficient and versatile detection means that can accurately identify, analyze, and quantify a variety of target substances based on mass-to-charge ratio (m/z). With technology iteration, the generation efficiency and the structural complexity of mass spectrum data are remarkably improved, and the high-dimensional and general-quantity characteristics of the mass spectrum data make data analysis face serious challenges, and particularly the mass spectrum data are more prominent in scenes with extremely high requirements on precision, such as cancer screening. In the field of mass spectrometry data analysis, conventional machine learning algorithms (such as support vector machines, random forests, XGBoost, etc.) are commonly used and applied to benchmark tests, but rely on complex preprocessing procedures. For example, the steps of denoising, baseline correction, peak identification, peak alignment and the like are needed to eliminate interference factors in the signal acquisition process, so that the stability of the subsequent multivariate analysis and the interpretation of the results are enhanced. These preprocessing operations are not only time consuming, but may also result in the loss of critical information due to human intervention. In recent years, deep learning technology has become a core method of data analysis by virtue of its advantages in feature automatic extraction. The continuous innovation of the architectures such as Convolutional Neural Network (CNN), cyclic neural network (RNN) and transducer obviously expands the boundary of data processing capability. However, a core difficulty in mass spectrometry data analysis is how to construct high information density feature representations. To balance computational efficiency with model performance, existing methods typically map high-dimensional mass spectral vectors to low-dimensional space, but such strategies are mostly limited to single-channel representations, making it difficult to adequately capture the multi-dimensional correlations of the data. In contrast, the image classification field has demonstrated the superiority of multi-channel characterization-different channels can capture the complementary information of the spectrum, texture, etc. of the target separately, thereby providing a more comprehensive description. The concept provides an important implication for mass spectrometry, namely, through multi-channel collaborative modeling, characteristics (such as global intensity distribution and local peak mode) of different levels in mass spectrum data can be more flexibly represented, and further, an embedded vector with better discrimination can be generated. Therefore, developing an efficient and robust multi-channel representation is critical to improving classification accuracy and optimizing computing resource utilization. In this context, convolutional Neural Networks (CNNs) are ideal choices for multi-channel modeling due to their unique feature extraction mechanisms. CNNs are able to mine potential features from the original mass spectrum signal by the spatially invariant nature of the convolution kernel (e.g., translational invariance), and further enhance the identification of critical structures by pooling operations. More importantly, CNN can directly learn multi-scale feature expression from data independently without relying on artificial design feature engineering. By stacking a combination of convolution layers and nonlinearities, CNNs can dynamically build multi-channel feature maps while capturing global trends and local details in mass spectral data. The multi-level feature fusion not only enhances the representation capability of the model, but also greatly improves the adaptability of the model to high-dimensional data. Experiments show that the CNN-based multichannel model can still maintain high classification accuracy even in the scene with noise or signal offset. At present, the method has shown remarkable advantages in the direct classification task of the original mass spectrum data, and becomes an important technical path in the field. Disclosure of Invention Aiming at the defects in the prior art, the original mass spectrum data classification method based on the multichannel embedded representation solves the problems of low efficiency of original mass spectrum data classification and low accuracy of classification tasks. In order to achieve the above purpose, the technical scheme adopted by the invention is that the original mass spectrum data classification method based on multichannel embedded representation comprises the following ste