CN-120526862-B - Infection pathogen type detecting system based on 5 markers

CN120526862BCN 120526862 BCN120526862 BCN 120526862BCN-120526862-B

Abstract

The invention provides an infection pathogen type detection system based on 5 markers, wherein the 5 markers comprise GSDMD, p-MLKL, IL-6, PCT and CRP, the system comprises (1) a detection model for detecting the content of GSDMD, p-MLKL, IL-6, PCT and CRP of a sample, (2) an analysis module for analyzing data based on a random forest algorithm and a neural network algorithm to construct a bacterial infection diagnosis model, and (3) a diagnosis module for judging bacterial infection of a sample to be diagnosed based on a machine learning algorithm model. The accuracy of the system diagnosis provided by the invention reaches 0.909, and the sensitivity and specificity of the diagnosis, positive predictive value and negative predictive value are respectively 0.95, 0.85, 0.90 and 0.92.

Inventors

CHEN XIAOPING
ZHENG HAO
ZHANG BIKE
WU YUAN
DONG YUJUN
LIANG ZEYIN

Assignees

中国疾病预防控制中心传染病预防控制所

Dates

Publication Date: 20260508
Application Date: 20250724

Claims (9)

1. A pathogen infection type detection system for non-diagnostic purposes based on 5 markers, wherein the 5 markers comprise GSDMD, p-MLKL, IL-6, PCT, CRP, the system comprising the following modules: (1) A training data set input module for inputting sample data, constructing a training data set, wherein each sample data in the training data set comprises 5 input features of detection values of GSDMD, p-MLKL, IL-6, PCT and CRP and a two-class diagnosis label of bacterial infection and non-bacterial infection of the sample; (2) The training module is used for training the training data set constructed by the training data set input module through a machine learning method and establishing a two-class prediction model of bacterial infection and non-bacterial infection based on 5 input characteristics of detection values of GSDMD, p-MLKL, IL-6, PCT and CRP; (3) The prediction module is used for predicting the two kinds of diagnosis labels of bacterial infection or non-bacterial infection on the basis of the prediction model obtained by the training module and 5 input characteristics of the detection values of the GSDMD, the p-MLKL, the IL-6, the PCT and the CRP of the input sample to be detected; The machine learning method in the training module is carried out by adopting a random forest classification model, the random forest classification model comprises a base learner composed of 100 decision trees, each decision tree randomly samples from a training data set by a self-help sampling method to generate a training data subset, 2 features are randomly selected from all features to carry out optimal splitting when nodes are split each time, the minimum splitting sample number of the nodes is set to be 2, each decision tree is independently trained, and the weights of each class are automatically adjusted to balance the influence of each class in the model training process, a two-class diagnosis label is predicted by a majority voting method, and In the prediction module, the two classification diagnosis labels of the sample to be detected are predicted by adopting a majority voting method consistent with the training module on 5 input characteristics of the detection values of the GSDMD, the p-MLKL, the IL-6, the PCT and the CRP of the input sample to be detected.
2. The 5-marker based pathogen infection type detection system of claim 1, wherein in the training module, each decision tree uses a keni index as a split criterion at node split, the keni index at node t being defined as: , Wherein, the For the proportion of the kth class samples in the node t, K is the total number of classes, and the feature and the splitting point which minimize the weighted average base-Ni index are selected for each splitting.
3. The pathogen infection type detection system based on 5 markers of claim 1, wherein in the training module weight class adjustment is set to: , Wherein, the K is the number of classes, and K is the number of class c samples.
4. The pathogen infection type detection system based on 5 markers according to claim 1, wherein in the training module, the predictive category is determined by a majority voting method: , Wherein, the For the predicted class, C is the set of all possible classes, I () is the indicator function, then 1 is taken, otherwise 0 is taken.
5. A pathogen infection type detection system for non-diagnostic purposes based on 5 markers, wherein the 5 markers comprise GSDMD, p-MLKL, IL-6, PCT, CRP, the system comprising the following modules: (1) A training data set input module for inputting sample data, constructing a training data set, wherein each sample data in the training data set comprises 5 input features of detection values of GSDMD, p-MLKL, IL-6, PCT and CRP and a two-class diagnosis label of bacterial infection and non-bacterial infection of the sample; (2) The training module is used for training the training data set constructed by the training data set input module through a machine learning method and establishing a two-class prediction model of bacterial infection and non-bacterial infection based on 5 input characteristics of detection values of GSDMD, p-MLKL, IL-6, PCT and CRP; (3) The prediction module is used for predicting the two kinds of diagnosis labels of bacterial infection or non-bacterial infection on the basis of the prediction model obtained by the training module and 5 input characteristics of the detection values of the GSDMD, the p-MLKL, the IL-6, the PCT and the CRP of the input sample to be detected; The machine learning method is carried out by adopting a multi-layer feedforward neural network, the training module comprises an input layer, 1 to 2 hidden layers and 1 output layer, wherein the input layer comprises 5 nodes of detection values of GSDMD, p-MLKL, IL-6, PCT and CRP, the output layer is provided with 1 node, an activation function adopts a Sigmoid function to realize two-class probability output, the multi-layer feedforward neural network adopts a random gradient descent method to carry out back propagation, the learning rate is set to be 0.001, L2 regularization is selected to prevent overfitting, a cross entropy loss function is calculated, when the loss function is observed to be stable, the model is converged, training is stopped, an independent test data set is established, and In the prediction module, 5 input characteristics of detection values of GSDMD, p-MLKL, IL-6, PCT and CRP of a sample to be detected are input into a neural network, the probability of outputting 0-1 is calculated through each layer, a judgment threshold value is set, if the output probability is greater than or equal to the threshold value, the judgment is positive class 1, namely bacterial infection, and otherwise, the judgment is negative class 0, namely non-bacterial infection.
6. The 5-tag-based pathogen infection type detection system of claim 5, wherein each layer of the hidden layer comprises 10-32 neurons, and the activation function uses a Sigmoid function: the output of the neurons is: Wherein, the For the output of the ith neuron of the upper layer, As the weight of the material to be weighed, In order for the offset to be a function of, To activate the function.
7. The 5-marker based pathogen infection type detection system of claim 5, wherein the cross entropy loss function is: Wherein, the For a true diagnostic tag of either 0 or 1, Probabilities are predicted for the model.
8. The pathogen infection type detection system based on 5 markers according to claim 5, wherein the prediction module performs feature normalization processing consistent with the training phase on the sample to be predicted, inputs the sample into a neural network, calculates the sample to output a probability value ranging from 0 to 1 through each layer, sets a judgment threshold, judges that the sample is positive, namely bacterial infection, if the output probability is greater than or equal to the threshold, and judges that the sample is negative, namely non-bacterial infection, if the output probability is greater than or equal to the threshold.
9. The 5-marker based pathogen infection type detection system of claim 8, wherein the decision threshold is 0.5.

Description

Infection pathogen type detecting system based on 5 markers Technical Field The invention discloses a pathogen type detection system, and belongs to the technical field of microorganisms and artificial intelligence. Background Early accurate diagnosis of blood flow infection (Blood Stream Infection, BSI) is a key element for improving prognosis, and conventional biomarkers for clinical detection at present comprise procalcitonin PCT, acute phase response protein CRP, and interleukin IL-6. PCT is currently recognized as the best bacterial infection biomarker, and has the highest diagnosis accuracy, but the clinical application performance of PCT still has the limitation that ① has the overall specificity of identifying pathogen types of only about 70 percent (Meta analysis data), ② is more suitable for sepsis severity assessment rather than early diagnosis of blood flow infection because of the hysteresis rise in 24 hours of infection, and ③ is remarkable that non-infectious factors (such as burn, intestinal ischemia, postoperative wound and the like) can also cause PCT abnormality rise. Therefore, the sensitivity (generally < 85%) and specificity (generally < 90%) of the existing markers have not reached ESCMID (society of clinical microbiology and infectious diseases) recommended diagnostic technical standards for blood flow infection, and there is a need for the discovery and validation of novel markers. The object of the present invention is to provide a more sensitive and specific pathogen-type detection system. Disclosure of Invention Based on the above objects, the present invention provides a pathogen infection type detection system based on 5 markers including GSDMD, p-MLKL, IL-6, PCT, CRP, the system comprising the following modules: (1) A training data set input module for inputting sample data, constructing a training data set, wherein each sample data in the training data set comprises 5 input features of detection values of GSDMD, p-MLKL, IL-6, PCT and CRP and a two-class diagnosis label of bacterial infection and non-bacterial infection of the sample; (2) The training module is used for training the training data set constructed by the training data set input module through a machine learning method and establishing a two-class prediction model of bacterial infection and non-bacterial infection based on 5 input characteristics of detection values of GSDMD, p-MLKL, IL-6, PCT and CRP; (3) And the prediction module is used for making a prediction of bacterial infection or non-bacterial infection on the basis of the prediction model obtained by the training module on 5 input characteristics of the detection values of the GSDMD, the p-MLKL, the IL-6, the PCT and the CRP of the input samples to be detected. In a preferred embodiment, the machine learning method in the training module is performed using a random forest classification model comprising 100 decision trees, each decision tree forming a base learner by self-sampling (Bootstrap Sampling), randomly sampling from a training dataset to generate a training dataset subset, randomly selecting 2 features from all features for optimal splitting each time a node splits, setting the minimum number of split samples to 2, training each decision tree independently, and automatically adjusting each class weight to balance the influence of each class in the model training process, predicting a two-class diagnostic tag by a majority voting method, and In the prediction module, the two classification diagnosis labels of the sample to be detected are predicted by adopting a majority voting method consistent with the training module on 5 input characteristics of the detection values of the GSDMD, the p-MLKL, the IL-6, the PCT and the CRP of the input sample to be detected. In a more preferred embodiment, in the training module, each decision tree uses a base Index (Gini Index) as a splitting criterion when splitting a node, the base Index of a node t being defined as: , Wherein p k is the proportion of the kth sample in the node t, K is the total number of categories, and each split selects the feature and split point that minimizes the weighted average base index. In another more preferred embodiment, in the training module, the weight class adjustment is set to: , where n samples is the total number of samples, K is the number of classes, and n c is the number of class c samples. In yet a more preferred embodiment, in the training module, the prediction category is determined by majority voting: , Wherein, the For predicting the class, C is the set of all possible classes, I (-) is the indicator function, taking 1 when y i =c, otherwise taking 0. In another preferred embodiment, the machine learning method in the training module is performed using a multi-layer feedforward neural network, the training module includes an input layer containing 5 nodes of detection values of GSDMD, p-MLKL, IL-6, PCT and CRP, 1 to 2 hidden layers and 1 output layer, the outp