CN-119479814-B - Method for optimizing molecular marker algorithm based on machine learning and immune escape mechanism
Abstract
The invention discloses a method for optimizing a molecular marker algorithm based on machine learning and immune escape mechanisms, which comprises the steps of obtaining NGS sequencing data of tumor tissues and normal tissues, carrying out quality control and comparison on the NGS sequencing data of the tumor tissues and paired normal tissues to obtain an original comparison file, preprocessing the original comparison file to obtain a final comparison file, carrying out tumor somatic cell mononucleotide variation and small fragment indemnity detection on the comparison file through software to obtain somatic cell variation information, constructing a multiple linear regression model, calculating the neonatal antigen prediction capability of SNV sites and INDEL sites, and optimizing a TMB calculation method by combining two factors of different neonatal antigen prediction capability and immune escape mechanism of tumor patients. The invention reflects the real immune state of the patient more truly, and has better prompting effect on the immune treatment response and prognosis of the patient.
Inventors
- CHENG XUEYAN
- XU LIDI
- ZHANG JIAO
- HUANG YU
- CHEN WEIZHI
Assignees
- 无锡臻和生物科技有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20241126
Claims (4)
- 1. A method for optimizing a molecular marker algorithm based on machine learning and immune escape mechanisms, comprising the steps of: S1, acquiring NGS sequencing data of tumor tissues and normal tissues, wherein the NGS sequencing data comprises whole genome sequencing data or whole exon capturing sequencing data based on second generation sequencing; S2, performing quality control and comparison on the NGS sequencing data of the tumor tissue and the paired normal tissue to obtain an original comparison file, and preprocessing the original comparison file to obtain a final comparison file; S3, detecting single nucleotide variation and small fragment insertion deletion of tumor somatic cells of the final comparison file, and filtering an original variation list obtained by detection to obtain somatic cell variation information; s4, HLA typing is carried out on the NGS sequencing data of the normal tissue of the tumor patient, and an HLA four-position typing result of the patient is obtained; S5, predicting a tumor neoantigen peptide of a patient by adopting NETMHCPAN deep learning algorithm based on the somatic variation information filtered in the S3, and screening the tumor neoantigen peptide; S6, evaluating HLA heterozygosity deficiency condition of the tumor sample based on the final comparison file obtained in the S2; S7, counting the SNV and INDEL mutation numbers and the predicted neoantigen peptide number in a TCGA database to construct a multiple linear regression model, and calculating the neoantigen prediction capacity of SNV sites and INDEL sites; s8, combining two factors of different prediction capacities of the neoantigens and immune escape mechanisms of tumor patients, and optimizing a TMB calculation method; s7, counting the SNV and INDEL mutation numbers and the predicted neoantigen peptide number in a TCGA database to construct a multiple linear regression model, wherein the calculation of the neoantigen prediction capacity of the SNV locus and the INDEL locus specifically comprises the following steps: (1) Obtaining TCGA mutation data, classifying the mutation data according to TCGA rules, filtering the mutation data, and reserving the mutation of the FILTER column as PASS, then converting a maf file in the mutation data into a standard vcf file by using vcf2maf software, and sequencing the mutation data in the vcf file according to chromosome positions; (2) Obtaining HLA type I typing data of a patient in a TCGA database; (3) After the mutation information and HLA typing data of the sample are matched, predicting and filtering tumor neoantigen peptide fragments of the sample; (4) Counting the SNV and INDEL mutation numbers of patients in a TCGA database and the number of predicted tumor neoantigen peptide fragments, constructing a multiple linear regression model, and calculating the neoantigen prediction capacity of SNV sites and INDEL sites; Wherein, the The number of neoantigens predicted for the patient samples incorporated into the model, To incorporate a model of patient samples for predicting the number of SNV sites of neoantigens, To incorporate a model of patient samples for predicting the number of INDEL sites of neoantigens, The ability to produce neoantigens for SNV is weighted, The ability to produce neoantigens for INDELs is weighted, Intercept values for the multiple linear regression model; In S8, combining two factors of different prediction capacities of the neoantigens and immune escape mechanisms of tumor patients, optimizing the TMB calculation method specifically comprises the following steps: The original TMB calculation formula is as follows: Wherein, the Mutation sites that are ultimately detected for the patient, including SNV and INDEL, Covering the size of the genome range for the detection kit; Combining two factors of different prediction capacities of the neoantigens and immune escape mechanisms of tumor patients, the TMB calculation method is optimized as follows: (1) Combining the tumor patient neoantigen predicted peptide list obtained in the step S5 and the HLA heterozygosity deficiency state obtained in the step S6, counting the ratio of the number of neoantigens combined by the deleted HLA types in the number of neoantigens combined by all alleles of the patient, and taking the ratio as the neoantigen failure state caused by the HLA heterozygosity deficiency of the tumor patient; (2) Calculating SNV and INDEL of a tumor patient according to different production weights based on the neoantigen prediction capability of the SNV and the INDEL obtained by linear regression to obtain a TMB value after mutation type correction; (3) Combining the above 2 steps, correcting the estimation deviation of the neoantigen caused by two layers of immune escape and mutation type, wherein the calculation formula is as follows: Wherein, the To correct the TMB based on immune escape mechanisms, For the corrected tumor mutation load value, Representing the number of binding of high quality neoantigen peptide fragments after patient screening to HLA class g alleles, wherein hlaloh is HLA allele list judged to be heterozygous deleted in step S6, gi is allele in hlaloh list, total is all HLA allele list of the patient obtained in step S4, gj is allele in total list, The number of single base nucleotide mutations that are ultimately detected by the patient, The number of indel mutations that are ultimately detected for the patient, Covering the size of the genome range for the detection kit; s8 further comprises optimizing the TMB calculation method based on different prediction abilities of the neoantigens of SNV and INDEL: Based on the neoantigen prediction capability of SNV and INDEL obtained by linear regression, calculating SNV and INDEL of a tumor patient with different production weights to obtain a TMB value after mutation type correction, wherein the calculation formula is as follows: Wherein, the The number of single base nucleotide mutations that are ultimately detected by the patient, The number of indel mutations that are ultimately detected for the patient, The size of the genome-wide coverage for the detection kit.
- 2. The method for optimizing a molecular marker algorithm based on a machine learning and immune escape mechanism according to claim 1, wherein in S2, quality control and comparison are performed on NGS sequencing data of a tumor tissue and a paired normal tissue to obtain an original comparison file, and preprocessing is performed on the original comparison file to obtain a final comparison file, which specifically comprises: Performing quality control on the NGS sequencing data of the tumor tissue and the paired normal tissue, and removing a low-quality sequence, a low-complexity sequence and a connector sequence in the NGS sequencing data by using fastp software; Then, using BWA-MEN software to compare the filtered CLEAN READS to the ginseng test genome hg19 to obtain an original comparison file; Preprocessing the original comparison file, (1) sorting by samtools software, (2) marking repeated sequences by Picard software, and (3) carrying out local heavy comparison and base quality correction by GATK software to obtain the final comparison file.
- 3. The method for optimizing a molecular marker algorithm based on machine learning and immune escape mechanisms according to claim 1, wherein in S3, the final alignment file is subjected to tumor somatic single nucleotide variation and small fragment indels detection, and an original variation list obtained by the detection is filtered, so that somatic variation information is obtained specifically including: Detecting single nucleotide variation and small fragment indels of tumor somatic cells by using MuTect software based on the final alignment file; The original mutation list obtained by detection is filtered to obtain somatic mutation information, wherein the filtering steps comprise (1) retaining non-silent mutation including missense mutation, nonsense mutation, frame shift mutation and cut site mutation, (2) retaining high-credibility mutation sites, wherein the sequence comparison quality of the sites is more than 30, the base sequencing quality is more than 20, the reading of positive and negative chains of mutation is unbiased, the depth of the mutation sites is more than 50X, the allele frequency of the mutation is more than 0.03, and (3) removing the mutation sites with high mutation frequency and retaining the sites with the TCGA database frequency less than 0.01.
- 4. The method for optimizing a molecular marker algorithm based on a machine learning and immune escape mechanism according to claim 1, wherein in S5, based on the somatic mutation information filtered in S3, predicting a tumor neoantigen peptide of a patient by adopting NETMHCPAN deep learning algorithm, and screening the tumor neoantigen peptide specifically comprises: Based on the somatic mutation information filtered in the step S3, predicting tumor neoantigen peptide fragments of a patient by adopting NETMHCPAN deep learning algorithm, respectively calculating binding affinity inhibition concentrations of a site mutant type and a pairing wild type and the predicted peptide fragments, calculating differential affinity multiples of each site based on the affinity inhibition concentrations, and then screening the predicted tumor neoantigen peptide fragments.
Description
Method for optimizing molecular marker algorithm based on machine learning and immune escape mechanism Technical Field The invention relates to the field of molecular marker algorithm optimization, in particular to a method for optimizing a molecular marker algorithm based on machine learning and immune escape mechanisms. Background The study of tumor mutational burden (tumor mutation burden, TMB) as a prognostic biomarker for non-small cell lung cancer immunotherapy originated from a content of the 2015 Science journal, which found that non-small cell lung cancer patients above median TMB had longer progression-free survival (PFS). Thereafter, numerous large studies, CHECK MATE-026, CHECK MATE-227, and the like, demonstrated that TMB could be an independent biomarker for immunotherapy. In 2020, the therapeutic drug palbociclizumab against programmed CELL DEATH protein-1, pd-1, one of the immune checkpoints, was approved by the us FDA for the treatment of tumor highly mutated solid tumor patients, TMB was widely used as a biomarker for tumor immunotherapy. In theory, the higher mutation load can increase the possibility of generating new antigens, so as to increase the possibility of immune recognition and killing tumor cells, thereby affecting the effect of immunotherapy, so that the research focus of the existing TMB related scientific research or clinical detection products is to accurately detect the mutation condition of patients by optimizing a sequencing strategy and a mutation detection algorithm. And counting the number of detected high-quality mutations, and evaluating the tumor mutation load condition of each patient by combining the coverage range of the detection product, so as to provide a proper immunotherapy scheme for tumor patients. The primary way in which the mutated tumor neoantigen exerts an immune effect is by transferring the neoantigen peptide from the HLA class I molecule to cd8+ T cells, thereby eliciting a downstream immune response. However, in advanced stages of cancer such as lung cancer, a large number of tumor patients experience heterozygous deletions of HLA genes, the deleted HLA types lose the ability to bind to the original mutant peptide, and not all predicted neoantigenic peptide is presented to cd8+ T cells. Therefore, the traditional method only considers the number of the neoantigens generated by mutation, and does not consider the HLA heterozygosity deficiency state of the patient, so that based on the development of the prior art, the biomarker can be corrected based on the immune escape mechanism. Disclosure of Invention The invention aims to provide a method for optimizing a molecular marker algorithm based on machine learning and immune escape mechanisms, which can reflect the real immune state of a patient more truly and has better prompting effect on the immune treatment response and prognosis of the patient. In order to achieve the above object, the present invention provides the following solutions: A method for optimizing a molecular marker algorithm based on machine learning and immune escape mechanisms, comprising the steps of: S1, acquiring NGS sequencing data of tumor tissues and normal tissues, wherein the NGS sequencing data comprises whole genome sequencing data or whole exon capturing sequencing data based on second generation sequencing; S2, performing quality control and comparison on NGS sequencing data of the tumor tissue and the paired normal tissue to obtain an original comparison file, and preprocessing the original comparison file to obtain a final comparison file; S3, detecting single nucleotide variation and small fragment insertion deletion of tumor somatic cells of the final comparison file, and filtering an original variation list obtained by detection to obtain somatic cell variation information; S4, HLA typing is carried out on NGS sequencing data of normal tissues of the tumor patient, and HLA four-bit typing results of the patient are obtained; S5, predicting a tumor neoantigen peptide of a patient by adopting NETMHCPAN deep learning algorithm based on the somatic variation information filtered in the S3, and screening the tumor neoantigen peptide; S6, evaluating HLA heterozygosity deficiency condition of the tumor sample based on the final comparison file obtained in the S2; S7, counting the SNV and INDEL mutation numbers and the predicted neoantigen peptide number in a TCGA database to construct a multiple linear regression model, and calculating the neoantigen prediction capacity of SNV sites and INDEL sites; S8, combining two factors of different prediction capacities of the neoantigens and immune escape mechanisms of tumor patients, and optimizing a TMB calculation method. Preferably, in S2, quality control and comparison are performed on NGS sequencing data of a tumor tissue and a paired normal tissue to obtain an original comparison file, and preprocessing is performed on the original comparison file to obtain a final compar