CN-121545597-B - Hepatocyte cancer prognosis and immune response prediction method based on multiple-study machine learning
Abstract
The invention discloses a hepatocellular carcinoma prognosis and immune response prediction method based on multi-group machine learning, which relates to the technical field of biomedical science and artificial intelligence intersection and comprises the steps of S1, constructing a multi-group data set for model training and verification, S2, carrying out feature integration and cluster analysis on the multi-group data set to obtain corresponding hepatocellular carcinoma molecular subtype distribution, S3, constructing a hepatocellular carcinoma prognosis model based on Cox regression combined random survival forest, identifying 11 core immune genes and corresponding weight coefficients through training of the model, constructing an immune therapy response index IMLIRI score, and S4, carrying out clinical application on the IMLIRI score. According to the invention, through multi-study data integration and multi-queue external verification, the influence of data deviation and queue heterogeneity on model performance is reduced, so that the method shows stable prediction performance in liver cell cancer queues with different sources and different etiology backgrounds, and the reliability and popularization of the model in clinical application are improved.
Inventors
- JIANG TAO
- GUAN JINGBO
- ZHANG YUJIE
- LI YINGLONG
- XU LIN
- ZHANG LINSHUAI
- LI CHEN
Assignees
- 成都中医药大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260119
Claims (10)
- 1. A method for predicting prognosis and immune response of hepatocellular carcinoma based on multiple-learning machine learning, comprising: S1, collecting multiple groups of chemical data related to hepatocellular carcinoma from a cancer genome map database TCGA to form a TCGA-LIHC queue, collecting an external verification queue from a gene expression comprehensive database GEO, and simultaneously constructing a dataset for model training and verification by combining an immunotherapy special queue; S2, performing feature integration and cluster analysis on multiple groups of the mathematical data to obtain corresponding subtype distribution of the hepatocellular carcinoma molecules, and judging whether a TCGA-LIHC queue has data offset or not based on the subtype distribution of the hepatocellular carcinoma molecules; S3, constructing a hepatocellular carcinoma prognosis model based on Cox regression combined random survival forest, identifying 11 core immune genes and corresponding weight coefficients through training of the model, and constructing the following immune therapy response indexes IMLIRI scores: IMLIRI = (-0.099×eef1b2 expression level) +(-0.308×fhl3 expression level) +(-0.087×mmp1 expression level) +(-0.128×mapk7 expression level) +(-0.259×mphosph6 expression level) +(-0.248× NROB1 expression level) + (0.190×rlf expression level) + (0.203×rgs7 expression level) + (0.434×gstm1 expression level) + (0.341×pik3IP1 expression level) + (0.273×ppp1R1A expression level); S4, carrying out multidimensional verification on IMLIRI scores given by a hepatocellular carcinoma prognosis model by adopting an external verification queue and an immune treatment special queue; Wherein in S1, the multiple sets of chemical data comprise mRNA abundance quantified by TPM, standardized long-chain non-coding RNA level, microRNA data converted by log 2 , illumina Human Methylation chip DNA methylation data, somatic mutation spectrum, survival time, death state, tumor-lymph node-metastasis stage, and treatment scheme; when the hepatocellular carcinoma prognosis model is trained, the screening result after each iteration is evaluated by adopting a three-level evaluation criterion: taking a TCGA-LIHC queue 5-fold cross-validation average consistency index C-indexI as an immediate optimization target to carry out primary evaluation on screening results after each round of iteration, when the average C-index I is not lower than a preset performance threshold C 1 , when the variation amplitude of the C-indexI compared with the previous round of iteration results does not exceed a preset minimum performance reduction threshold delta 1, or the lifting amplitude of the C-indexI compared with the previous round of iteration results does not exceed a preset minimum performance lifting threshold delta 2, judging that the primary evaluation is passed, otherwise, adaptively adjusting the model structure super-parameters, and entering the next round of iteration; Performing secondary evaluation on the evaluating generalization capability of the average consistency index C-indexII in the five independent GEO verification queues, when C-indexII is not lower than a preset generalization threshold C 2 and the performance fluctuation among the verification queues is not more than a preset allowable range, judging that the secondary evaluation is passed, otherwise, performing model training again; And checking the calibrated cross-model consistency index to realize three-level evaluation, judging that the three-level evaluation passes when the consistency index reaches or is higher than a preset consistency threshold C 3 , terminating iteration and outputting a final core immune gene set, and otherwise, realizing joint optimization in an alternate iteration mode.
- 2. The method for predicting prognosis and immune response of hepatocellular carcinoma based on multiple sets of machine learning of claim 1, wherein the external verification queue comprises GSE76427, GSE15654, GSE10143, GSE14520 and GSE116174 downloaded from GEO, wherein the related data information in the external verification queue is consistent with the content of the multiple sets of machine learning data, and the external verification queue is integrated into a META queue after being corrected by batch effect; The special immunotherapy queue comprises a IMvigor-210 queue and GSE78220, GSE135222 and GSE91061 downloaded from GEO, wherein the IMvigor-210 queue is a hepatocellular carcinoma patient treated by a programmed death ligand 1 inhibitor, and the GSE78220, GSE135222 and GSE91061 queues are patient data treated by an immune checkpoint inhibitor; The inclusion criteria of the TCGA-LIHC queue and the external verification queue comprise that the pathology is diagnosed with hepatocellular carcinoma, has complete survival time and death state information, at least comprises mRNA abundance expression data, and excludes patients which are combined with other malignant tumors and are diagnosed only through autopsy or death proof.
- 3. The method for predicting prognosis and immune response of hepatocellular carcinoma based on multiple-learning machine learning of claim 2, wherein in S1, the batch effect correction is to perform batch effect elimination on mRNA abundance expression data of an external validation queue with the queue source as batch variable and the survival state as covariate; wherein the correction content of the mRNA abundance expression data comprises: Filling the missing value of the numerical variable similar to the mRNA abundance by adopting the median of the variable in the corresponding queue; For classification type variables similar to tumor stage, filling by adopting the most frequent classification; and directly eliminating variables with the deletion rate of more than 20 percent.
- 4. The method for predicting prognosis and immune response of hepatocellular carcinoma based on multiple-learning machine-learning of claim 1, wherein in S2, the distribution of the molecular subtypes of hepatocellular carcinoma is obtained by the following method: S20, screening data information in a TCGA-LIHC queue to screen out immune related gene characteristics from mRNA expression data, screening out prognosis characteristics I of top-m before expression mutation from miRNA, lncRNA and methylation data, screening out prognosis characteristics II of top-n before mutation frequency from somatic mutation data, and combining the gene characteristics, prognosis characteristics I and prognosis characteristics II to obtain a plurality of groups of chemically integrated characteristic sets; S21, performing integrated cluster analysis on a plurality of groups of the learning integrated feature sets by adopting a plurality of unsupervised cluster algorithms, determining the cluster prediction index and the gap statistics to be a cancer subtype I and a cancer subtype II, and determining a final cancer subtype by adopting a majority voting method for the cluster results of the plurality of unsupervised cluster algorithms; S22, evaluating the data migration of the TCGA-LIHC queue through subtype verification and characterization analysis; The parting reliability in subtype verification is finished through nearest template prediction and a partitioning algorithm around a centroid, and when Kappa coefficients are all more than 0.6, parting is proved to be reliable; the subtype molecular distribution difference in subtype verification is completed by utilizing principal component analysis, t distribution random neighborhood embedding, unified manifold approximation and projection dimension reduction visualization; The pathway enrichment difference between subtypes in the characterization analysis is completed by a genetic set mutation analysis, wherein the cancer subtype I enriches metabolic pathways and the cancer subtype II enriches proliferation-related pathways; the characterization analysis also comprises analyzing the tumor microenvironment by calculating the immune cell infiltration fraction and immune characteristic score, and determining that the cancer subtype I is immune activation type and the cancer subtype II is immune inhibition type.
- 5. The method for predicting prognosis and immune response of hepatocellular carcinoma based on multi-set machine learning as claimed in claim 1, wherein in S3, the hepatocellular carcinoma prognosis model is an optimal algorithm combination for screening Cox regression combined random survival forest from 303 algorithm combinations generated by basic algorithm + integration strategy; the content of the integration strategy comprises a main evaluation index taking a training queue and a verification queue average consistency index as a core, and an auxiliary index covering the area under a curve and checking a p value by a logarithmic rank.
- 6. The method for predicting prognosis and immune response of hepatocellular carcinoma based on multiple-learning machine-learning of claim 1, wherein in S3, the 11 core immune genes are obtained by the following methods: s30, performing univariate Cox regression in a TCGA-LIHC queue and an external verification queue respectively to screen out candidate immune related gene sets which are related to prognosis and have consistent risk ratio directions in all the queues; S31, inputting the candidate immune related gene set into a joint training frame of a Cox regression joint random survival forest, and adopting an alternative iteration mode combined with a self-adaptive adjustment model structure hyper-parameter mode to construct closed loop optimization of feature selection, hyper-parameter optimization and survival prediction to promote convergence of the candidate immune related gene set so as to obtain 11 corresponding core immune genes after convergence.
- 7. The method for predicting prognosis and immune response of hepatocellular carcinoma based on multiple-learning machine learning of claim 6, wherein the adaptive adjustment of the model structural superparameter is performed by: Setting a wide initial interval for key super parameters of two types of models, performing a first round of global rough search, and recording performance by taking a 5-fold cross validation consistency index C-indexI as an objective function; Sequencing the primary results according to C-indexI, extracting high potential clusters of a plurality of percentiles, fitting a parameter-performance response surface by using local weighted regression, and combining density clustering to position a performance concentration area as a potential area; and carrying out self-adaptive shrinkage optimization on the super parameters in a mode of automatically shrinking the parameter boundaries of the potential areas and carrying out high-resolution local intensive search, dynamically updating the centers and the ranges of the potential areas according to the latest performance in each round until convergence conditions are met.
- 8. The method for predicting prognosis and immune response of hepatocellular carcinoma based on multiple-learning machine learning of claim 6, wherein the alternate iteration is to extract the importance index of the feature from the Cox regression combined random survival forest to calculate the comprehensive contribution degree respectively in each round; when the marginal contribution of any gene to the consistency index C-index is positive in the cross verification process and exceeds a preset minimum contribution threshold, the sampling weight or the retention probability of the corresponding gene pair is improved in the next round, otherwise, the gene pair is subjected to gradual weight reduction or elimination treatment; the importance index refers to regression coefficient stability and selected frequency in a Cox regression algorithm and split gain or variable importance in a random survival forest algorithm.
- 9. The method for predicting prognosis and immune response of hepatocellular carcinoma based on multiple-learning machine-learning of claim 1, wherein in S3, the optimal cut-off value of IMLIRI is determined based on the total lifetime of the TCGA-LIHC queue, and the patients are divided into low IMLIRI group and high IMLIRI group.
- 10. The method for predicting prognosis and immune response of hepatocellular carcinoma based on multiple-learning machine-learning of claim 1, wherein in S4, the multi-dimensional verification comprises: Carrying out consistency index comparison verification on IMLIRI scores given by a hepatocellular carcinoma prognosis model and clinical parameters and an existing model; and the dimension II is that IMLIRI scores given by a hepatocellular carcinoma prognosis model are divided into a low-risk group and a high-risk group, and the low-risk group and the high-risk group are analyzed and verified on the aspects of immune therapy response probability, immune cell rejection and myeloid-derived suppressor cell wettability by using a tumor immunophenotype tracking algorithm, tumor immune dysfunction and rejection framework.
Description
Hepatocyte cancer prognosis and immune response prediction method based on multiple-study machine learning Technical Field The invention relates to the technical field of intersection of biomedicine and artificial intelligence. More particularly, the invention relates to a method for predicting prognosis and immune response of hepatocellular carcinoma based on multi-group machine learning, which is used for predicting prognosis and immune response of hepatocellular carcinoma in the scenes of clinical tumor diagnosis and treatment decision support, tumor molecular mechanism research, accurate medical scheme development and the like. Background Hepatocellular carcinoma is a malignant tumor with sixth global morbidity and third mortality, and is the most important pathological subtype of liver cancer (accounting for 75% -85%), and the treatment means comprise operations, interventions, radiotherapy, chemotherapy, immunotherapy, targeted therapy and the like, but most patients are in middle and late stages when diagnosed, and the opportunity of radical surgery is missed, so that immune checkpoint inhibitors become key treatment options of the patients. In actual use, the immune checkpoint inhibitor reverses the depletion of T cells and restores the anti-tumor immune response by blocking the immune checkpoint pathway such as the programmed death receptor 1/programmed death receptor-ligand 1, and prolongs the median overall survival of the patient from 13.4 months to 19.2 months. For example, the total survival time of patients is prolonged from 13.4 months to 19.2 months compared with that of sorafenib, the objective response rate is improved from 5% to 30%, and the combined bevacizumab solution of the dulcitol You Shan and the bevacizumab solution also becomes a standard therapy for treating the first unresectable liver cancer, and the total survival time is superior to that of sorafenib. However, clinical data shows that only about 30% of hepatocellular carcinoma patients respond to immune checkpoint inhibitors, only 20% -30% can realize long-term survival, and the treatment response of similar clinical stage patients is extremely heterogeneous, which highlights the limitation of purely relying on clinical stage to guide immunotherapy, and accurate biomarkers and prediction tools are needed urgently. The existing hepatocellular carcinoma prognosis and treatment response prediction technology has the defects of single data dimension, insufficient algorithm robustness, incomplete functional coverage, limited clinical practicality and the like, and specifically comprises the following steps: 1. Most of the existing hepatocellular carcinoma prediction models depend on single histology data (such as mRNA expression profile only), key molecular characteristics such as DNA methylation, copy number variation, somatic mutation and the like are ignored, tumor heterogeneity and immune microenvironment regulation mechanisms cannot be comprehensively reflected, and in an external verification queue, the consistency index is reduced to below 0.5. For example, predictive models based in part on gene expression only focus on immune cell infiltration-related genes, and do not incorporate epigenetic or mutation level information, and it is difficult to explain the phenomenon of "patient response differences under the same gene expression pattern". 2. The existing hepatocellular carcinoma prediction model mostly adopts a single machine learning algorithm, has weak generalization capability (such as logistic regression and single random forest), does not consider the complexity of high-dimensional multi-group data, is easy to be subjected to over fitting, has poor verification performance in a multi-center queue, and is easy to be influenced by data deviation and over fitting. For example, the consistency index of a part of the model in a training queue can reach 0.6, but suddenly drops to below 0.5 in an external verification queue, so that the generalization capability is weak, and the multi-center clinical application requirement cannot be met. 3. The existing hepatocellular carcinoma prediction model is functionally split, can not realize prognosis evaluation and immune therapy response prediction at the same time, focuses on single indexes of prognosis prediction or therapy response only, and lacks 'prognosis-therapy response' integrated evaluation capability. For example, a partial model may predict patient survival but cannot determine whether it is suitable for immunotherapy, and a partial immune response prediction model does not correlate with long-term survival outcomes, resulting in inefficiency for the clinician to use in conjunction with multiple tools. Part of the models depend on gene detection or tissue biopsy, increase medical cost and patient burden, lack visual clinical tools (such as nomograms), and are difficult to popularize and apply because doctors can read results only by having professional bioinformati