CN-122017102-A - LC-MS original data feature detection method

CN122017102ACN 122017102 ACN122017102 ACN 122017102ACN-122017102-A

Abstract

The invention relates to a characteristic detection method of LC-MS original data, which realizes automatic characteristic detection of the LC-MS original data through a plurality of steps. First, chromatographic peak area detection is carried out on LC-MS original data of the same batch, the detected chromatographic peak area in a sample is reserved, and a two-dimensional intensity matrix related to the sample and scanning time is constructed according to the chromatographic peak area. And sequentially obtaining the position and boundary of the LC-MS characteristic from the two-dimensional matrix through the steps of edge detection, dynamic programming and the like, thereby realizing the characteristic detection of the batch of LC-MS original data.

Inventors

XU GUOWANG
YANG JUN
LIU XINYU

Assignees

中国科学院大连化学物理研究所

Dates

Publication Date: 20260512
Application Date: 20241111

Claims (10)

1. The characteristic detection method of the LC-MS original data is characterized by comprising the following steps: Firstly, acquiring LC-MS original data files of samples through mass spectrometry, sequentially detecting chromatographic peak areas ROI of all LC-MS original data files corresponding to M samples, and determining the chromatographic peak areas ROI in all samples to obtain samples containing different types of chromatographic peak areas ROI; Secondly, for M samples with a certain chromatographic peak region ROI, constructing a two-dimensional intensity matrix related to the samples and scanning time according to the chromatographic peak region ROI; thirdly, performing edge detection in the vertical direction on the two-dimensional intensity matrix to obtain a two-dimensional gradient matrix; fourthly, performing non-maximum inhibition on the two-dimensional gradient matrix; Fifthly, calculating a positive score matrix and a negative score matrix for the matrix after non-maximum suppression through a dynamic programming algorithm, and respectively determining the starting point and the end point of the chromatographic peak in M samples according to the positive score and the negative score; Sixthly, integrating peak areas of the chromatographic peaks according to the starting points and the ending points of the M sample chromatographic peaks obtained in the fifth step, so as to obtain response intensity distribution of a certain characteristic in the M samples; and repeating the second step to the sixth step to obtain the response intensity distribution of each feature in the sample containing the corresponding type of chromatographic peak region ROI.
2. The method for detecting the characteristics of the raw data of LC-MS as claimed in claim 1, wherein the detection and integration of the chromatographic peak region ROI comprises the steps of: Detecting chromatographic peak Regions (ROIs) of all samples sequentially through centWave algorithm to obtain m/z average values, initial retention time and termination retention time of the ROIs in each LC-MS original data file; and according to the ROI time range and the m/z of each sample, carrying out time combination on the ROI with the m/z mean value within a certain range, and obtaining different kinds of chromatographic areas ROI of each sample.
3. The method for detecting the characteristics of the raw data of LC-MS according to claim 1, wherein said constructing a two-dimensional intensity matrix with respect to the sample and the scan time from the chromatographic peak region ROI comprises the steps of: For a certain coexisting chromatographic peak region ROI, extracting chromatographic intensity signals formed by scanning points under M/z and time ranges from the M samples according to the M/z average value and the time ranges of the common chromatographic peak region ROI of the M samples, wherein the intensity signals comprise three vectors with the same length, namely a time sequence, an M/z sequence and an intensity sequence, and the time sequences are continuously and progressively increased scanning time sequences; Selecting a specified sample from all samples comprising M samples as a reference sample, and performing time linear interpolation on M/z sequences and intensity sequences of current chromatographic peak areas (ROIs) of all other samples according to the ROI chromatographic signal time sequence extracted by the reference sample, so that the ROI sequence length of other samples is identical to the sequence length of the reference sample; Recording the time sequence of the reference sample, and superposing the interpolated intensity sequences of all other samples to form a two-dimensional intensity matrix taking the intensity sequence of each sample as a row, wherein the row number of the matrix is the number of the samples, and the column number of the matrix is the length of the interpolated intensity sequence and is the same as the time sequence length of the reference file.
4. The method for detecting the characteristics of the raw data of LC-MS as claimed in claim 1, wherein said performing the edge detection of the two-dimensional intensity matrix in the vertical direction comprises the steps of: for a two-dimensional intensity matrix, a smoothed two-dimensional intensity matrix is obtained by carrying out Gaussian kernel convolution on the two-dimensional intensity matrix; and carrying out Sobel kernel convolution on the smoothed two-dimensional intensity matrix in the vertical direction to obtain a two-dimensional gradient matrix.
5. The method for detecting the characteristics of the raw data of LC-MS according to claim 1, wherein the performing non-maximum suppression on the two-dimensional gradient matrix is specifically: When the value in the two-dimensional gradient matrix is larger than 0 and not larger than the values on the left side and the right side of the row, the value of the position is changed into 0; When the value in the two-dimensional gradient matrix is smaller than 0 and not smaller than the values on the left and right sides of the line, the value of the position is changed to 0.
6. The method for detecting the characteristics of the original LC-MS data according to claim 1, wherein the calculating the positive score matrix and the negative score matrix for the matrix after the non-maximum suppression by the dynamic programming algorithm, and determining the start point and the end point of the chromatographic peak in M samples according to the positive score and the negative score, respectively, comprises the following steps: Normalizing the matrix subjected to non-maximum inhibition by dividing the positive number in the matrix by the maximum value of the matrix and dividing the negative number in the matrix by the minimum value of the matrix; initializing four score matrixes and direction matrixes with the same size as the matrix, wherein the score matrixes and the direction matrixes comprise a positive score matrix, a negative score matrix, a positive direction matrix and a negative direction matrix, and the first row is 0; Separating positive number values and negative number values of the normalized matrix to obtain a positive number matrix and a negative number matrix respectively, summing columns, calculating entropy values respectively, and taking the entropy values respectively calculated by the positive number matrix and the negative number matrix as the ROI matrix entropy; The last line of numerical values of the positive score matrix are kept, if the local maximum value is larger than the entropy of the ROI matrix multiplied by the number of samples, the position is regarded as the starting point of the feature, and the position is traced back according to the positive direction matrix to obtain the starting points of the feature in all M samples; For the last line of numerical value of the negative score matrix, reserving a local minimum value, and if the minimum value is smaller than the entropy of the ROI matrix multiplied by the number of samples, taking the position as the end point of the feature, and backtracking the position according to the negative direction matrix to obtain the end points of the feature in all M samples; And if the obtained starting points and the obtained end points are paired, namely the starting points and the end points are the same in number and the starting point time is earlier than the end point time, reserving the characteristics and the time starting points and the end points thereof, otherwise, not considering the characteristics of the time region.
7. The method for detecting the characteristics of the raw data of LC-MS according to claim 1, wherein the chromatographic peak is integrated in terms of peak area, the retention time corresponding to the start point and the end point is obtained according to the start point and the end point obtained by the characteristics in all samples, and the peak area is integrated according to the intensity in the time range.
8. The method according to claim 1, wherein the characteristic refers to a chromatographic peak of a specific compound at a certain m/z and retention time, and the peak area of the characteristic is proportional to the concentration of the compound from which the characteristic is derived in the sample in the same batch of raw LC-MS data.
9. A feature detection system for LC-MS raw data, comprising: The system comprises a LC-MS (liquid crystal-mass spectrometry) original data file of a sample, an ROI (region of interest) matrix construction module, a two-dimensional intensity matrix, a color spectrum matrix analysis module and a color spectrum analysis module, wherein the LC-MS original data file of the sample is acquired through mass spectrometry, the LC-MS original data file corresponding to all M samples is sequentially detected by the color spectrum peak regions ROI, and the color spectrum peak regions ROI in all samples are determined to acquire samples containing different kinds of color spectrum peak regions ROI; The system comprises a ROI matrix processing module, a dynamic programming algorithm, a matrix analysis module and a matrix analysis module, wherein the ROI matrix processing module is used for carrying out edge detection in the vertical direction on a two-dimensional intensity matrix to obtain a two-dimensional gradient matrix, carrying out non-maximum suppression on the two-dimensional gradient matrix, calculating a positive score matrix and a negative score matrix on the matrix after non-maximum suppression through the dynamic programming algorithm, and respectively determining the starting point and the end point of a chromatographic peak in M samples according to the positive score and the negative score; and the characteristic detection and integration module is used for integrating the peak areas of the chromatographic peaks according to the starting points and the ending points of the obtained M sample chromatographic peaks, so as to obtain the response intensity distribution of a certain characteristic in the M samples.
10. A computer readable storage medium, wherein a computer program is stored on the storage medium, which when executed by a processor, implements a method for detecting characteristics of LC-MS raw data according to any one of claims 1-8.

Description

LC-MS original data feature detection method Technical Field The invention relates to the field of LC-MS detection and signal processing, in particular to an LC-MS original data characteristic detection method based on an image processing technology. Background LC-MS based non-targeted metabonomics is an important tool for studying disease, drug metabolism, environmental exposure and health, aiming at comprehensive qualitative and quantitative analysis of small molecules in biological samples. The data processing and analysis flow of non-targeted metabonomics based on LC-MS mainly comprises three steps, detecting features from LC-MS data, performing qualitative analysis of the compounds on the features, and performing statistical analysis and biological interpretation on the relative quantitative data of the features. Accurate feature detection is therefore important for the discovery of subsequent biological knowledge. The conventional LC-MS data feature detection mainly performs peak detection through centWave and a continuous wavelet transform algorithm (BMC Bioinformatics 2008 9DOI:10.1186/1471-2105-9-504,Bioinformatics 2006 22(17)2059-65DOI:10.1093/bioinformatics/btl355) in XCMS, then obtains stably-appearing features through peak matching, and relatively quantifies the features according to peak areas. However, the method has the characteristics that a large number of false positive and false negative results exist (ANALYTICAL CHEMISTRY201789 (17) 8689-95DOI: 10.1021/acs.analchem.7b01069), and relative quantitative results are inaccurate, which causes interference to downstream statistical analysis. In recent years, although various methods of feature detection have been proposed, the accuracy of feature detection and the accuracy of quantification have not been solved well at the same time. Disclosure of Invention The invention provides an LC-MS original data feature detection method based on an image processing technology. In order to achieve the purpose of the invention, chromatographic peak areas obtained from all samples are integrated to obtain a common ROI area, signals in all samples are extracted according to the common ROI area to construct an ROI matrix, characteristics and a retention time range thereof are obtained through a series of processing and operation on the ROI matrix, and peak area integration is carried out according to the time range of the characteristics to obtain relative quantitative data of the characteristics. The technical scheme adopted by the invention for realizing the purpose is that the characteristic detection method of the LC-MS original data comprises the following steps: Firstly, acquiring LC-MS original data files of samples through mass spectrometry, sequentially detecting chromatographic peak areas ROI of all LC-MS original data files corresponding to M samples, and determining the chromatographic peak areas ROI in all samples to obtain samples containing different types of chromatographic peak areas ROI; Secondly, for M samples with a certain chromatographic peak region ROI, constructing a two-dimensional intensity matrix related to the samples and scanning time according to the chromatographic peak region ROI; thirdly, performing edge detection in the vertical direction on the two-dimensional intensity matrix to obtain a two-dimensional gradient matrix; fourthly, performing non-maximum inhibition on the two-dimensional gradient matrix; Fifthly, calculating a positive score matrix and a negative score matrix for the matrix after non-maximum suppression through a dynamic programming algorithm, and respectively determining the starting point and the end point of the chromatographic peak in M samples according to the positive score and the negative score; Sixthly, integrating peak areas of the chromatographic peaks according to the starting points and the ending points of the M sample chromatographic peaks obtained in the fifth step, so as to obtain response intensity distribution of a certain characteristic in the M samples; and repeating the second step to the sixth step to obtain the response intensity distribution of each feature in the sample containing the corresponding type of chromatographic peak region ROI. The detection and integration of the chromatographic peak region ROI comprises the following steps: Detecting chromatographic peak Regions (ROIs) of all samples sequentially through centWave algorithm to obtain m/z average values, initial retention time and termination retention time of the ROIs in each LC-MS original data file; and according to the ROI time range and the m/z of each sample, carrying out time combination on the ROI with the m/z mean value within a certain range, and obtaining different kinds of chromatographic areas ROI of each sample. The construction of the two-dimensional intensity matrix with respect to the sample and the scan time from the chromatographic peak region ROI comprises the steps of: For a certain coexisting chromat