CN-122024835-A - Prediction model of microbial targeted sequencing positive threshold value, construction method and application thereof

CN122024835ACN 122024835 ACN122024835 ACN 122024835ACN-122024835-A

Abstract

The invention relates to a prediction model of a microbial targeted sequencing positive threshold value, a construction method and application thereof. The method comprises the following steps of (S1) collecting information of a known microorganism sample and tNGS detection results thereof as a dataset, (S2) screening tNGS characteristic quantities in the detection results as a characteristic quantity set, (S3) dividing the dataset into a training set and a testing set, training a machine learning model, wherein a prediction label of the training set is a strain identification result of the known microorganism sample, and the characteristic quantity set is characterized by the fact that the data set is obtained. According to the invention, the interference signals are removed through data cleaning, the characteristic quantity is selected and extracted, the key information is extracted, the machine learning model learns the optimal threshold rule, the high consistency of the detection result and the gold standard is realized, meanwhile, the contribution degree of each characteristic quantity to threshold judgment is clear, and the support is provided for the interpretability of the detection result.

Inventors

CAO BIN
WU CHUNQIU
WANG YEMING
LIU JUN
YAN MENGWEI
HU CHAOHUI
CHEN DAN
Lu Binghuai
ZHU PENGYUAN
ZHANG YULIN
WU JING
Pu Dongya
WANG YANG
LIU QI

Assignees

中日友好医院(中日友好临床医学研究所)
广州市金圻睿生物科技有限责任公司

Dates

Publication Date: 20260512
Application Date: 20260121

Claims (10)

1. A method of constructing a predictive model of a microbial targeted sequencing positive threshold, the method comprising the steps of: (S1) collecting information of known microorganism samples and tNGS detection results thereof as a dataset; (S2) screening tNGS characteristic quantities in the detection result to be used as a characteristic quantity set; And (S3) dividing the data set into a training set and a testing set, training a machine learning model, wherein a prediction label of the training set is a strain identification result of a known microorganism sample, and the characteristic quantity set is characterized.
2. The method of constructing a predictive model for a microbial targeted sequencing positive threshold according to claim 1, wherein the information of the known microbial sample of step (S1) includes a strain identification result; Preferably, the tNGS detection result in the step (S1) includes the number of original sequences of the sample, the number of sequences Q30 of the sequence, the number of normalized sequences of the internal reference, the number of normalized sequences of the pathogen, and the number of sequences of the background pathogen.
3. The method for constructing a prediction model of a positive threshold value of microbial targeted sequencing according to claim 1 or 2, wherein the step (S1) further comprises a step of cleaning a dataset, wherein the method comprises the steps of performing data complementation by adopting an interpolation complementation mode when tNGS detection results are less missing, and rejecting a sample when the identification results of the missing strain or the detection results in the sample are too much missing.
4. A method of constructing a predictive model for a microbial targeted sequencing positive threshold according to any one of claims 1-3, wherein step (S2) specifically comprises: (S2-1) constructing an initial feature quantity set by selecting features related to negative/positive judgment of microorganisms based on the biological principle of microorganism detection; (S2-2) eliminating redundant feature quantity, namely calculating variance of each feature quantity in an initial feature quantity set, eliminating low variance feature quantity which does not contribute to sample distinction, calculating correlation between every two feature quantities, and eliminating feature vectors with low contribution degree in two feature quantities with correlation lower than a preset threshold value; preferably, the calculation method of the correlation and the contribution degree comprises any one of a random forest contribution degree and pearson correlation coefficient method, a variance expansion factor method, a LASSO regression method or a mutual information method; Preferably, the initial feature quantity set includes raw_reads_num、clean_reads_num、raw_Q30、valid_reads_num、valid_micro_reads_num、sum_nc、valid_nail_rpk、valid_host_rpk、patho_reads、patho_genus_reads、patho_RPK、patho_specific_rpk、patho_ratio_by_clean、patho_ratio_by_valid、patho_ratio_by_micro、patho_ratio_by_genus、amp_hit_num、amp_all_num、amp_cov_ratio、patho_median_rpk、pos_ppv、patho_mean_clean、patho_iqr_mean、bio_ppv、bio_se、pos_se、pos_neg_ratio、LightGBM_ShapV、LightGBM_pathotype、patho_clinical_level、patho_intra_ntc_coef、patho_inter_ntc_coef、patho_unique_rpk、patho_intra_samnum、patho_intra_samratio、patho_other_runsample、patho_ratio_by_run、patho_pvalue、neighbor_size、neighbor_mean、neighbor_ratio、high_neighbor_position、high_neighbor_value and high_neighbor_ratio.
5. The method of constructing a predictive model for a microbial targeted sequencing-positive threshold of any one of claims 1-4, wherein the machine learning model comprises at least one of a multi-layer perceptron, logistic regression, random forest, XGBoost, naive bayes, support vector machine, or K-nearest neighbor algorithm.
6. The method of constructing a predictive model for a microbial targeted sequencing-positive threshold of any one of claims 1-5, wherein the training of step (S3) comprises: Adopting a cross validation method to iteratively train candidate models, and taking preset performance indexes on a validation set as optimization targets to perform optimization on the model super-parameters; taking the average performance index obtained by the cross verification method as an evaluation basis, and selecting an optimal characteristic quantity combination from all characteristic quantities by a characteristic selection algorithm to be used as a key characteristic quantity set; comparing the average performance of the machine learning models in a cross validation method, and selecting the model with the highest average performance and the corresponding key feature quantity set to jointly form a prediction model; preferably, the feature selection algorithm comprises a recursive feature elimination method; preferably, the average performance index comprises at least one of average AUC, average accuracy, average F1 score, average recall, or average accuracy.
7. A model for predicting a microbial targeted sequencing positive threshold, wherein the model is constructed by the method for constructing a model for predicting a microbial targeted sequencing positive threshold according to any one of claims 1 to 6.
8. Use of the method of constructing a predictive model of a microbial targeted sequencing positive threshold according to any one of claims 1-6 or the predictive model of a microbial targeted sequencing positive threshold according to claim 7 in microbial targeted sequencing.
9. An electronic device comprising one or more processors and memory for storing executable instructions, wherein the one or more processors are configured to invoke the executable instructions stored by the memory to implement the steps in the method of constructing a predictive model of a microbial targeted sequencing positive threshold of any of claims 1-6.
10. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the steps in the method of constructing a predictive model of a microbial targeted sequencing positive threshold of any one of claims 1-6.

Description

Prediction model of microbial targeted sequencing positive threshold value, construction method and application thereof Technical Field The invention belongs to the technical fields of pathogenic microorganism molecular detection technology and bioinformatics, and relates to a prediction model of a microbial targeted sequencing positive threshold value, a construction method and application thereof. Background Targeted sequencing (tNGS) is a new generation of sequencing technology based on the combination of pathogen targeted enrichment and high throughput sequencing technology. The method comprises the steps of performing super-multiplex polymerase chain reaction (Polymerase Chain Reaction, PCR) amplification or probe capture enrichment on nucleic acid in a sample to be detected by using a large number of primers aiming at specific gene sequences to obtain a large number of target nucleic acid fragments so as to amplify target pathogen signals and reduce host interference, and performing bioinformatics analysis on the obtained sequences, thereby realizing high-sensitivity and high-resolution identification. In the detection of the microorganism molecule, the judgment threshold is the core basis for distinguishing the positive result and the negative result, and the accuracy of the detection result is directly determined. The traditional judgment threshold setting method is mostly dependent on manual experience (such as manually demarcating a threshold line based on the number of pathogen sequences detected by a standard substance) or fixed statistical rules (such as taking 2-3 times of a blank control signal value as a threshold value), and has the following remarkable defects that (1) the adaptability is poor, the signal sensitivity and background noise of different detection batches (such as different operators and different experimental instruments) are different, the false positive or false negative is easily caused by using the fixed threshold value on different instruments for the same detection item, (2) the anti-interference capability is weak, when inhibitors (such as hemoglobin in blood and humic acid in soil) exist in a sample or a low-concentration target microorganism amplification signal is weak, the traditional threshold value cannot effectively distinguish an effective signal from an interference signal, and (3) the subjectivity is strong, the manual setting of the threshold value dependent operator experience is that different operators can have differences in threshold judgment of the same batch of data, and the repeatability of detection results is poor. With the maturity of machine learning (MECHINE LEARNING, ML) technology in the fields of data mining and pattern recognition, it has the ability to learn characteristic rules and self-adaptive adjustment and judgment rules from massive detection data. According to the difference of learning norms and model structures, machine learning can be mainly divided into classical linear models, such as logistic regression (Logistic Regression), which are widely applied to two classification problems due to the characteristics of transparency and high calculation efficiency of the models, bayes models based on probability theory, which are represented by Naive Bayes (Naive Bayes), and are excellent in the fields of text classification and the like by relying on condition independence assumption, and K-nearest neighbor algorithms (K-Nearest Neighbors, KNN) based on examples, which are intuitive in concept and perform classification and regression by measuring similarity among samples. As the demand for modeling non-linear problems grows, a series of more powerful models are proposed. The support vector machine (Support Vector Machine, SVM) maps the low-dimensional non-linearity problem to the high-dimensional linearly separable feature space through kernel function technique, exhibiting excellent performance in small sample classification. The decision tree and the integrated learning framework thereof further improve the expression capability and generalization capability of the model, namely, random Forest (RF) builds a plurality of decision trees through Bootstrap sampling, effectively reduces the overfitting risk through a voting mechanism, and a representative algorithm XGBoost (eXtreme Gradient Boosting) of gradient lifting direction corrects the prediction error of the previous round through iteratively training a series of weak learners (such as decision trees), thereby proving excellent prediction precision in a plurality of data science contests. Furthermore, artificial neural networks (ARTIFICIAL NEURAL NETWORK, ANN) inspired by biological neural networks are powerful tools to handle highly nonlinear relationships. The multi-layer perceptron (Multilayer Perceptron, MLP) with a basic structure and powerful functions can approach any complex continuous function with any precision by matching with a nonlinear activation function through a h