EP-4361624-B1 - METHOD FOR ESTIMATING CONTENT RATIO OF COMPONENTS CONTAINED IN SAMPLE, COMPOSITION ESTIMATING DEVICE, AND PROGRAM

EP4361624B1EP 4361624 B1EP4361624 B1EP 4361624B1EP-4361624-B1

Inventors

NAITO, MASANOBU
HIBI, Yusuke

Dates

Publication Date: 20260506
Application Date: 20220606

Claims (15)

A method of inferring a content ratio of a component in an inference target sample containing at least one type of the component selected from K types of components while K is an integer equal to or greater than 1, the method comprising: preparing learning samples of a number equal to or greater than K containing at least one type of component selected from the K types of components and having compositions different from each other, and a background sample not containing the component; sequentially ionizing gas components generated by thermal desorption and/or pyrolysis while heating each sample in a sample set including the inference target sample, the learning samples, and the background sample, and observing mass spectra continuously; storing, into respective rows, mass spectra acquired for a plurality of heating temperatures to acquire two-dimensional mass spectra of the respective samples, and merging at least two or more of the two-dimensional spectra and converting the spectra into a data matrix; performing NMF process by which the data matrix is subjected to non-negative matrix factorization to be factorized into the product of a normalized base spectrum matrix and a corresponding intensity distribution matrix; extracting a noise component in the intensity distribution matrix through analysis on canonical correlation between the base spectrum matrix and the data matrix, and correcting the intensity distribution matrix so as to reduce influence by the noise component, thereby acquiring a corrected intensity distribution matrix; partitioning the corrected intensity distribution matrix into a submatrix corresponding to each of the samples, and expressing each of the samples in vector space using the submatrix as a feature vector; defining a K-1 dimensional simplex including all of the feature vectors and determining K end members in the K-1 dimensional simplex; and calculating a Euclidean distance between each of the K end members and the feature vector of the inference target sample, and inferring a content ratio of the component in the inference target sample on the basis of a ratio of the Euclidean distance, wherein if the K is equal to or greater than 3, at least one of the feature vectors of the learning samples is present in each region external to a hypersphere inscribed in the K-1 dimensional simplex or the learning samples contain at least one of the end members.
The method according to claim 1, wherein the K is an integer equal to or greater than 2, and the inference target sample is a mixture of the components.
The method according to claim 1 or 2, wherein the learning sample contains the end member.
The method according to claim 3, wherein the end member is determined on the basis of a determination label given to the learning sample.
The method according to claim 3, wherein the end member is determined through vertex component analysis on the feature vector of the learning sample.
The method according to claim 1, wherein the end member is determined on the basis of an algorithm by which a vertex is defined in such a manner that the K-1 dimensional simplex has a minimum volume.
The method according to claim 6, wherein the end member is determined by second NMF process by which the corrected intensity distribution matrix is subjected to non-negative matrix factorization to be factorized into the product of a matrix representing the weight fractions of the K types of components in the sample and a matrix representing an individual fragment abundance of each of the K types of components.
The method according to claim 6 or 7, wherein the learning sample does not contain the end member.
The method according to any one of claims 1 to 8, wherein acquiring the corrected intensity distribution matrix further includes making intensity correction on the intensity distribution matrix.
The method according to claim 9, wherein the sample set further includes a calibration sample, the calibration sample contains all the K types of components and has a known composition, and the intensity correction includes normalizing the intensity distribution matrix.
The method according to claim 9, wherein the intensity correction includes allocating at least part of the intensity distribution matrix using the product of the mass of the corresponding sample in the sample set and an environmental variable, and the environmental variable is a variable representing influence on ionization efficiency of the component during the observation.
The method according to claim 11, wherein the environmental variable is a total value of peaks in a mass spectrum of a compound having a molecular weight of 50-1500 contained in a predetermined quantity in an atmosphere during the observation or an organic low-molecular compound having a molecular weight of 50-500 contained in a predetermined quantity in each sample in the sample set.
A composition inference device that infers a content ratio of a component in an inference target sample containing at least one type of the component selected from K types of components while K is an integer equal to or greater than 1, the composition inference device comprising: a mass spectrometer that sequentially ionizes gas components generated by thermal desorption and/or pyrolysis while heating each of samples in a sample set including learning samples of a number equal to or greater than K containing at least one type of component selected from the K types of components and having compositions different from each other, a background sample not containing the component, and the inference target sample, and observes mass spectra continuously; and an information processing device that processes the observed mass spectra, wherein the information processing device comprises: a data matrix generating part that stores, into respective rows, mass spectra acquired for a plurality of heating temperatures to acquire two-dimensional mass spectra of the respective samples, and merges at least two or more of the two-dimensional spectra and converts the spectra into a data matrix; an NMF processing part that performs NMF process by which the data matrix is subjected to non-negative matrix factorization to be factorized into the product of a normalized base spectrum matrix and a corresponding intensity distribution matrix; a correction processing part that extracts a noise component in the intensity distribution matrix through analysis on canonical correlation between the base spectrum matrix and the data matrix, and corrects the intensity distribution matrix so as to reduce influence by the noise component, thereby generating a corrected intensity distribution matrix; a vector processing part that partitions the corrected intensity distribution matrix into a submatrix corresponding to each of the samples in the sample set, and expresses each of the samples in vector space using the submatrix as a feature vector; an end member determining part that defines a K-1 dimensional simplex including all of the feature vectors and determines K end members in the K-1 dimensional simplex; and a content ratio calculating part that calculates a Euclidean distance between each of the K end members and the feature vector of the inference target sample, and infers a content ratio of the component in the inference target sample on the basis of a ratio of the Euclidean distance, and if the K is equal to or greater than 3, at least one of the feature vectors of the learning samples is present in each region external to a hypersphere inscribed in the K-1 dimensional simplex or the learning samples contain at least one of the end members.
The composition inference device according to claim 13, wherein the end member determining part determines the end member on the basis of a determination label given to the learning sample.
A program used in a composition inference device that infers a content ratio of a component in an inference target sample containing at least one type of the component selected from K types of components while K is an integer equal to or greater than 1, the composition inference device comprising: a mass spectrometer that sequentially ionizes gas components generated by thermal desorption and/or pyrolysis while heating each of samples in a sample set including learning samples of a number equal to or greater than K containing at least one type of component selected from the K types of components and having compositions different from each other, a background sample not containing the component, and the inference target sample, and observes mass spectra continuously; and an information processing device that processes the observed mass spectra, the program comprising: a data matrix generating function of storing, into respective rows, mass spectra acquired for a plurality of heating temperatures by the mass spectrometer to acquire two-dimensional mass spectra of the respective samples, and merging at least two or more of the two-dimensional spectra and converting the spectra into a data matrix; an NMF processing function of performing NMF process by which the data matrix is subjected to non-negative matrix factorization to be factorized into the product of a normalized base spectrum matrix and a corresponding intensity distribution matrix; a correction processing function of extracting a noise component in the intensity distribution matrix through analysis on canonical correlation between the base spectrum matrix and the data matrix, and correcting the intensity distribution matrix so as to reduce influence by the noise component, thereby generating a corrected intensity distribution matrix; a vector processing function of partitioning the corrected intensity distribution matrix into a submatrix corresponding to each of the samples, and expressing each of the samples in vector space using the submatrix as a feature vector; an end member determining function of defining a K-1 dimensional simplex including all of the feature vectors and determining K end members in the K-1 dimensional simplex; and a content ratio calculating function of calculating a Euclidean distance between each of the K end members and the feature vector of the inference target sample, and inferring a content ratio of the component in the inference target sample on the basis of a ratio of the Euclidean distance, wherein if the K is equal to or greater than 3, at least one of the feature vectors of the learning samples is present in each region external to a hypersphere inscribed in the K-1 dimensional simplex or the learning samples contain at least one of the end members.

Description

Technical Field The present invention relates to a method of inferring a content ratio of a component in a sample, a composition inference device, and a program. Background Art As a method for inferring a component composition in a sample as a multicomponent mixing system, there is a conventionally-known method of separating components in the sample by some process and identifying and quantifying each fractionated component. One of such identifying methods uses a mass spectrum acquired by a mass spectrometer. However, information acquired from mass spectra is a huge volume of multidimensional data having a large number of correlations, imposing difficulty in extracting a distinctive signal leading to component identification intuitively or empirically. As described in Non-Patent Literature 1, a method of searching for a feature by reducing the dimension of such multidimensional data through main component analysis has been used recently. However, even such known data analysis technique is used, it is still inherently difficult to realize quantification using a mass spectrum having a peak intensity depending on ionization efficiency. For quantitative analysis, it is first required to determine ionization efficiency that is an unknown variable for each compound. To achieve this, isotopic samples having known sample volumes and having the same ionization efficiency should be used as benchmark materials. US 2007/0288174 A1 discloses as follows. Namely, "A system is provided for analyzing metabolomics data received from an analytical device across a group of samples. The system automatically receives a data matrix corresponding to each of the samples, wherein the data matrix includes rows corresponding to each of the samples and columns corresponding to a group of ions present in the respective samples. A processor is provided for determining a characteristic value corresponding to at least one of a group of components present in the samples, wherein the components are made up of at least a portion of the group of ions, using at least one of a correlation function and a factorization function. A user interface is in communication with the processor for displaying a visual indication of the characteristic value such that a user may receive a visual indication of the types of components present in the samples." US 2009/0121125 A1 discloses as follows. Namely, "A method of obtaining pure component mass spectra or pure peak elution profiles from mass spectra of a mixture of components involves estimating number of components in the mixture, filtering noise, and extracting individual component mass spectra or pure peak elution profiles using blind entropy minimization with direct optimization (e.g. downhill simplex minimization). The method may be applied to deconvolution of pure GC/MS spectra of overlapping or partially overlapping isotopologues or other compounds, separation of overlapping or partially overlapping compounds in proteomics or metabolomics mass spectrometry applications, peptide sequencing using high voltage fragmentation followed by deconvolution of the obtained mixture mass spectra, deconvolution of MALDI mass spectra in the separation of multiple components present in a single solution, and specific compound monitoring in security and/or environmentally sensitive areas." Non-Patent Literature 2 discloses as follows. Namely, "Metabolic profiling of biological samples involves nuclear magnetic resonance (NMR) spectroscopy and mass spectrometry coupled with powerful statistical tools for complex data analysis. Here, we report a robust, sparseness-based method for the blind separation of analytes from mixtures recorded in spectroscopic and spectrometric measurements. The advantage of the proposed method in comparison to alternative blind decomposition schemes is that it is capable of estimating the number of analytes, their concentrations, and the analytes themselves from available mixtures only. The number of analytes can be less than, equal to, or greater than the number of mixtures. The method is exemplified on blind extraction of four analytes from three mixtures in 2D NMR spectroscopy and five analytes from two mixtures in mass spectrometry. The proposed methodology is of widespread significance for natural products research and the field of metabolic studies, whereupon mixtures represent samples isolated from biological fluids or tissue extracts." Citation List Non-Patent Literature Non-Patent Literature 1: Analytical chemistry, 2020, vol. 92, issue 2, pp. 1925-1933Non-Patent Literature 2: Ivica Kopriva and Ivanka Jeric, Analytical Chemistry, Vol. 82, pp. 1911-1920 (2010) Summary of Invention Technical Problem According to the method described in Non-Patent Literature 1, much of information contained in the original multidimensional data is lost during the course of the dimension reduction. This disables quantitative analysis on a component even through qualitative analysis thereon is possible. F