CN-115472227-B - Marking method based on multidimensional intestinal flora characteristics and application thereof
Abstract
The invention relates to a marking method based on multidimensional intestinal flora characteristics and application thereof, belonging to the technical field of intersection of microbiology and artificial intelligence. First, calculating the sum of the first occurrence frequency and the relative abundance of the first bacteria, and screening all the first bacteria to obtain a second bacteria. And calculating the average relative abundance of the second bacteria, and screening all the second bacteria to obtain a third bacteria. And then calculating the average relative abundance difference coefficient of the third genus, and screening all the third genus to obtain a fourth genus. And finally, calculating the second occurrence frequency and the third occurrence frequency of the fourth genus, and screening all the fourth genus to obtain the differential genus, so as to finish the marking of the intestinal flora characteristics, thereby accurately determining the differential genus through gradual screening and further improving the prediction efficiency. In addition, the classifier model can be quickly established by constructing a sample set through the selected differential bacteria, and the classifier model can be evaluated, so that the accuracy of prediction can be greatly improved.
Inventors
- YE PENGPENG
- CHEN XIAOCHUN
Assignees
- 广西爱生生命科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20220831
Claims (8)
- 1. A method of labeling based on multi-dimensional intestinal flora characteristics, the method comprising: Obtaining an absolute abundance of each first genus in the intestinal flora of each of a plurality of samples, the samples comprising a healthy sample and a diseased sample; for each of the first genera, converting the absolute abundance of the first genera to a relative abundance, calculating a first frequency of occurrence of the first genera in all of the samples and a sum of the relative abundances of the first genera in all of the samples from the relative abundances; screening all the first bacteria according to the sum of the first occurrence frequency and the relative abundance to obtain a second bacteria; calculating, for each of said second genus, an average relative abundance of said second genus in all of said disease samples from the relative abundance of said second genus; screening all the second bacteria according to the average relative abundance to obtain a third bacteria; for each third genus, calculating the average relative abundance difference coefficient of the third genus according to the average relative abundance of the third genus, and screening all the third genus according to the average relative abundance difference coefficient to obtain a fourth genus; For each fourth genus, calculating a second occurrence frequency of the fourth genus in all the disease samples and a third occurrence frequency of the fourth genus in all the healthy samples according to the relative abundance of the fourth genus, and screening all the fourth genus according to the average relative abundance difference coefficient, the second occurrence frequency and the third occurrence frequency to obtain a difference genus, wherein the difference genus is a marking result of intestinal flora characteristics; The calculating the average relative abundance difference coefficient of the third genus according to the average relative abundance of the third genus specifically comprises calculating the average relative abundance difference coefficient of the third genus according to the average relative abundance of the third genus by using a difference coefficient calculation formula; the difference coefficient calculation formula comprises: ; Wherein, the Is the first Average relative abundance difference of the third genus; is the first Average relative abundance of the third genus; is the first group of healthy people Average relative abundance of the third genus.
- 2. The method of labeling of claim 1, wherein converting the absolute abundance of the first genus to a relative abundance comprises: Calculating, for each of the samples, a ratio of the absolute abundance of the first genus to the sum of the absolute abundances of all of the first genus in the sample, resulting in an intermediate abundance of the first genus; if yes, setting the intermediate abundance to be 0, otherwise, keeping the intermediate abundance unchanged, and obtaining the adjusted abundance of the first genus; calculating the ratio of the adjusted abundance of the first genus to the sum of the adjusted abundance of all the first genus in the sample to obtain the relative abundance of the first genus.
- 3. The method of claim 1, wherein the step of screening all the first bacteria for a second bacteria based on the sum of the first frequency of occurrence and the relative abundance comprises removing, from all the first bacteria, first bacteria having a first frequency of occurrence less than a second predetermined threshold and a sum of the relative abundance less than a third predetermined threshold, and obtaining a second bacteria.
- 4. The method of claim 1, wherein the step of screening all the second bacteria for a third bacteria based on the average relative abundance comprises selecting as the third bacteria a second bacteria having an average relative abundance greater than a fourth predetermined threshold or an average relative abundance less than a fifth predetermined threshold.
- 5. The method of claim 1, wherein the step of screening all the third bacteria for a fourth bacteria based on the average relative abundance difference comprises selecting a third bacteria having an absolute value of the average relative abundance difference greater than a sixth predetermined threshold as the fourth bacteria.
- 6. The method of labeling of claim 1, wherein the screening all of the fourth bacteria for differential bacteria based on the average relative abundance difference coefficient, the second frequency of occurrence, and the third frequency of occurrence comprises: Selecting a fourth genus having the average relative abundance difference greater than 0 as a first dominant genus for the disease sample; selecting a fourth genus having the average relative abundance difference coefficient less than 0 as a second dominant genus of the healthy sample; selecting a first dominant genus, wherein the second occurrence frequency is greater than or equal to the third occurrence frequency, or the difference between the third occurrence frequency and the second occurrence frequency is smaller than a seventh preset threshold value, as a differential genus; And selecting a second dominant genus, wherein the third occurrence frequency is larger than or equal to the second occurrence frequency, or the difference between the second occurrence frequency and the third occurrence frequency is smaller than the seventh preset threshold value, as a differential genus.
- 7. A classifier modeling evaluation method of intestinal flora characteristics, the method comprising: Obtaining differential bacteria by using the marking method of any one of claims 1-6, and constructing a sample set by taking the relative abundance of the differential bacteria of each sample in a plurality of samples as sample data; The method comprises the steps of dividing a sample set into a training set and a testing set, taking the training set as input, and respectively modeling by utilizing a plurality of machine learning algorithms to obtain classifier models corresponding to each machine learning algorithm, wherein the machine learning algorithms comprise a random forest algorithm, a linear regression algorithm, a K-nearest neighbor algorithm and a decision tree algorithm; and evaluating the classifier model corresponding to each machine learning algorithm according to an evaluation index by taking the test set as input, wherein the evaluation index comprises an accuracy rate, a recall rate, an F1-score and an ROC curve.
- 8. A terminal device comprising a processor and a computer readable storage medium for storing a plurality of instructions, the processor for implementing each of the instructions, the instructions being adapted to be loaded by the processor and to perform the following: Obtaining differential bacteria by using the marking method of any one of claims 1-6, and constructing a sample set by taking the relative abundance of the differential bacteria of each sample in a plurality of samples as sample data; The method comprises the steps of dividing a sample set into a training set and a testing set, taking the training set as input, and respectively modeling by utilizing a plurality of machine learning algorithms to obtain classifier models corresponding to each machine learning algorithm, wherein the machine learning algorithms comprise a random forest algorithm, a linear regression algorithm, a K-nearest neighbor algorithm and a decision tree algorithm; and evaluating the classifier model corresponding to each machine learning algorithm according to an evaluation index by taking the test set as input, wherein the evaluation index comprises an accuracy rate, a recall rate, an F1-score and an ROC curve.
Description
Marking method based on multidimensional intestinal flora characteristics and application thereof Technical Field The invention relates to the technical field of intersection of microbiology and artificial intelligence, in particular to a marking and screening method based on multidimensional intestinal flora characteristics and application thereof. Background Recent epidemiological, pathological, histologic, cellular and animal studies have revealed that microorganisms in the gut mediate metabolic health to a considerable extent, that gut flora affects host metabolic homeostasis, and that disturbed gut flora leads to the development of a variety of common metabolic diseases including obesity, type 2 diabetes, non-alcoholic liver disease, metabolic heart disease, malnutrition, etc. Intestinal microbiology is expected to play an important role in developing noninvasive fecal-based tests, dynamic monitoring and health prediction, and people can further understand their own health status by monitoring significant changes in their own intestinal flora abundance, thereby selecting a suitable way of their own health intervention. However, at present, people can only predict by monitoring the abundance of all bacteria in the intestinal tract, but the difference bacteria cannot be accurately determined, so that the prediction efficiency is low. Based on this, there is a need for a method of labeling multi-dimensional intestinal flora characteristics based on flora abundance and frequency and applications thereof. Disclosure of Invention The invention aims to provide a marking method based on multidimensional intestinal flora characteristics and application thereof, which can accurately determine differential bacteria, thereby improving prediction efficiency. In order to achieve the above object, the present invention provides the following solutions: A method of labeling based on multi-dimensional intestinal flora characteristics, the method comprising: Obtaining an absolute abundance of each first genus in the intestinal flora of each of a plurality of samples, the samples comprising a healthy sample and a diseased sample; for each of the first genera, converting the absolute abundance of the first genera to a relative abundance, calculating a first frequency of occurrence of the first genera in all of the samples and a sum of the relative abundances of the first genera in all of the samples from the relative abundances; screening all the first bacteria according to the sum of the first occurrence frequency and the relative abundance to obtain a second bacteria; calculating, for each of said second genus, an average relative abundance of said second genus in all of said disease samples from the relative abundance of said second genus; screening all the second bacteria according to the average relative abundance to obtain a third bacteria; for each third genus, calculating the average relative abundance difference coefficient of the third genus according to the average relative abundance of the third genus, and screening all the third genus according to the average relative abundance difference coefficient to obtain a fourth genus; For each fourth genus, calculating a second occurrence frequency of the fourth genus in all disease samples and a third occurrence frequency of the fourth genus in all healthy samples according to the relative abundance of the fourth genus, and screening all the fourth genus according to the average relative abundance difference coefficient, the second occurrence frequency and the third occurrence frequency to obtain a difference genus, wherein the difference genus is a labeling result of intestinal flora characteristics. A classifier modeling evaluation method of intestinal flora characteristics, the method comprising: Obtaining differential bacteria by using the marking method, and constructing a sample set by taking the relative abundance of the differential bacteria of each sample in a plurality of samples as sample data; The method comprises the steps of dividing a sample set into a training set and a testing set, taking the training set as input, and respectively modeling by utilizing a plurality of machine learning algorithms to obtain classifier models corresponding to each machine learning algorithm, wherein the machine learning algorithms comprise a random forest algorithm, a linear regression algorithm, a K-nearest neighbor algorithm and a decision tree algorithm; and evaluating the classifier model corresponding to each machine learning algorithm according to an evaluation index by taking the test set as input, wherein the evaluation index comprises an accuracy rate, a recall rate, an F1-score and an ROC curve. A classifier model of intestinal flora characteristics, the classifier model is constructed based on a classifier modeling evaluation method of intestinal flora characteristics: Obtaining differential bacteria by using the marking method, and constructing a sample set