CN-122024840-A - Baseline modeling method, MSI state detection method and storage medium
Abstract
The application relates to a baseline modeling method, an MSI state detection method and a storage medium, wherein a baseline sample and a target MSI site are selected, and the baseline sample is subjected to second generation sequencing to obtain a sequencing sequence; the method comprises the steps of grouping sequencing sequences according to unique molecular labels, conducting base correction on the sequencing sequences in each group, screening out optimal sequences of each group of sites, extracting multiple deletion lengths in the optimal sequences of each group of sites, determining the deletion proportion of each deletion length of each target MSI site, determining a probability quality function according to the deletion proportion of each deletion length of each target MSI site, and constructing a detection statistic threshold based on the probability quality function to form a baseline model for MSI state detection. Can adapt to different sequencing platforms and be applied simultaneously to different sample types of tissue and blood.
Inventors
- Qiu Changliang
- WANG JIANQING
- YANG SHUANG
- LUO JIEMIN
- Zheng Fangke
- ZHENG LIMOU
Assignees
- 厦门艾德生物医药科技股份有限公司
- 上海厦维医学检验实验室有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260204
Claims (13)
- 1. A method of baseline modeling, the method comprising: Selecting a baseline sample and a target MSI site, and performing second generation sequencing on the baseline sample to obtain a sequencing sequence; Grouping the sequencing sequences according to the unique molecular tag, correcting bases of the sequencing sequences in each group, and screening out the optimal sequence of the target MSI locus of each group; extracting a plurality of deletion lengths in the optimal sequence of each group of target MSI sites, and determining the deletion proportion of each deletion length of each target MSI site; and determining a probability quality function according to the deletion proportion of each deletion length of each target MSI site, and constructing a detection statistic threshold based on the probability quality function to form a baseline model for MSI state detection.
- 2. The method of claim 1, wherein grouping the sequencing sequences according to unique molecular tags, base correcting the sequencing sequences within each group, comprises: Grouping the sequencing sequences according to the unique molecular tags and sequencing start coordinates, and screening out effective groups of which the number of the sequencing sequences meets a preset number threshold; Base correction was performed on the sequenced sequences within each effective grouping.
- 3. The method of claim 2, wherein the screening for the optimal sequence of the target MSI sites for each group comprises: And aiming at each effective grouping, executing multi-layer priority selection operation in the groups, and screening out the optimal sequence of the target MSI sites of each group, wherein the multi-layer priority selection comprises progressive comparison in each group according to the priority sequence of a plurality of preset indexes.
- 4. The method of claim 3, wherein the plurality of predetermined indicators comprises a number of sequences, an alignment quality value, an average alkali matrix value, and an alignment length, and wherein the multi-tier priority selection comprises: Selecting the sequencing sequence with the largest number of sequences supporting the same repeated base length; Selecting a sequencing sequence with a higher alignment quality value when the number of sequences is the same; When the aligned quality values are the same, selecting a sequencing sequence with a higher average alkali matrix value; When the average base quality values are the same, sequencing sequences with longer alignment lengths are selected.
- 5. The method of claim 1, wherein the determining a probability mass function from the deletion proportion of each deletion length of each target MSI site, and constructing a detection statistic threshold based on the probability mass function, comprises: determining the average value of the deletion proportion of each deletion length of each target MSI site, taking the average value as a binomial distribution probability parameter, and determining a probability quality function; Carrying out statistical processing on the probability quality function to obtain statistics of each target MSI locus; Determining the mean value and standard deviation of the statistic of each target MSI site in the baseline sample, and calculating the detection statistic of each baseline sample; and counting the mean value and standard deviation of the detection statistic of each baseline sample, and determining a detection statistic threshold value.
- 6. The method of claim 5 wherein statistically processing the probability mass function to obtain statistics for each target MSI site comprises: Determining a cumulative distribution function according to the probability quality function and carrying out logarithmic processing to obtain a processing result; the processing results of the different deletion lengths are accumulated to determine statistics for each target MSI site.
- 7. The method of claim 1, wherein the extracting a plurality of deletion lengths in the optimal sequence of the set of target MSI sites, determining a deletion proportion for each deletion length for each target MSI site, comprises: Determining the repeat sequence length of the optimal sequence of the target MSI sites of each group; comparing the length of the repeated sequence with a basic reference sequence, and classifying each optimal sequence to obtain multiple deletion lengths; And counting the proportion of the deletion sequences with each deletion length in each target MSI site to all the optimal sequences of the target MSI site, and obtaining the deletion proportion of each deletion length.
- 8. The method of claim 7, wherein comparing the repeat length to a base reference sequence classifies each optimal sequence for a plurality of deletion lengths, comprising: comparing the length of the repeated sequence with a base reference sequence, and classifying each optimal sequence to obtain a normal sequence and a missing sequence; determining the deletion length according to the difference value between the length of the repeated sequence corresponding to the deletion sequence and the length of the repeated sequence of the base reference sequence, wherein the deletion length ranges from 1 base to 9 base.
- 9. The method according to claim 1, wherein the method further comprises: Screening out effective sequences which completely cover the MSI locus of the target from the sequencing sequences obtained by the second generation sequencing; Grouping the effective sequences according to the unique molecular tags, correcting the bases of the effective sequences in each group, and screening out the optimal sequence of the target MSI locus of each group.
- 10. A method of MSI status detection, the method comprising: Extracting multiple deletion lengths of each MSI locus of a sample to be detected, and determining the deletion proportion of each deletion length of each MSI locus; Determining a probability quality function according to the deletion proportion of each deletion length of each MSI site, and constructing MSI detection statistics of the sample to be detected based on the probability quality function; comparing the detection statistic threshold with MSI detection statistic of the sample to be detected, and determining MSI state result of the sample to be detected.
- 11. The method according to claim 10, characterized in that the method comprises: Performing second generation sequencing on the sample to be tested to screen out an effective sequence which completely covers a target MSI site; grouping the effective sequences according to the unique molecular tags, correcting the bases of the effective sequences in each group, and screening out the optimal sequence of each target MSI site.
- 12. An electronic device, the device comprising: One or more processors, and A memory storing computer readable instructions that, when executed, cause the processor to perform the operations of the method of baseline modeling of any one of claims 1 to 9, or the method of MSI status detection of claim 10 or 11.
- 13. A computer readable storage medium having stored thereon computer readable instructions executable by a processor to implement a method of baseline modeling according to any one of claims 1 to 9 or a method of MSI status detection according to claim 10 or 11.
Description
Baseline modeling method, MSI state detection method and storage medium Technical Field The application mainly relates to the technical field of biological detection, in particular to a baseline modeling method, an MSI state detection method and a storage medium. Background Under the condition of mismatch repair gene function defects (DEFECTIVE MISMATCH REPAIR, dMMR), the DNA of the microsatellite is mismatched, inserted or deleted due to factors such as chain slippage (STRAND SLIPPAGE) and the like in the replication process, so that the structure of the microsatellite is changed, and the microsatellite with the changed structure is microsatellite instability (microsatellite instability, MSI). MSI is one of the important biomarkers for PD-1/PD-L1 inhibitor treatment and can independently predict the efficacy of immunotherapy. MSI-H/dMMR exists in various solid tumors, the positive rate of the solid tumors is higher in patients with endometrial cancer, colorectal cancer and gastric cancer and is respectively 16% -33%, 6% -22% and 9% -22%, the positive rate of MSI-H/dMMR of other tumor types such as liver cancer, ampulla cancer, ovarian cancer, cervical cancer, esophageal adenocarcinoma, soft tissue tumor, head and neck cancer, kidney cancer, ewing sarcoma and the like is higher than 2%, and the positive rate of prostate cancer, lung cancer, breast cancer and the like is lower than 2%. The current MSI detection based on the second generation sequencing technology has the following problems: (1) Pon (panel of normal) are required for constructing baselines, but because different sequencing platforms have different degrees of sequencing error background, it is often necessary to construct different baselines for different sequencing platforms. (2) The method has certain requirements on the tumor content of the sample, is only suitable for detecting the tissue sample, and cannot be truly suitable for a blood sample detection scene with the tumor content as low as 0.5%. (3) Verification of accuracy typically focuses on only a single cancer species. (4) Because of the data nature of hybrid capture and amplicon sequencing differentiation, few second generation sequencing MSI detection algorithms can be used for both hybrid capture and amplicon sequencing data. Disclosure of Invention An object of the present application is to provide a baseline modeling method, an MSI status detection method and a storage medium, which solve the problem that the prior art cannot adapt to different sequencing platforms and cannot be applied to two main sample types of tissue and blood at the same time. According to one aspect of the present application, there is provided a method of baseline modeling, the method comprising: Selecting a baseline sample and a target MSI site, and performing second generation sequencing on the baseline sample to obtain a sequencing sequence; Grouping the sequencing sequences according to the unique molecular tag, correcting bases of the sequencing sequences in each group, and screening out the optimal sequence of the target MSI locus of each group; extracting a plurality of deletion lengths in the optimal sequence of each group of target MSI sites, and determining the deletion proportion of each deletion length of each target MSI site; and determining a probability quality function according to the deletion proportion of each deletion length of each target MSI site, and constructing a detection statistic threshold based on the probability quality function to form a baseline model for MSI state detection. Optionally, the grouping of the sequencing sequences according to the unique molecular tag, the base correcting the sequencing sequences within each group, comprises: Grouping the sequencing sequences according to the unique molecular tags and sequencing start coordinates, and screening out effective groups of which the number of the sequencing sequences meets a preset number threshold; And performing base correction on the sequencing sequences in each effective group. Optionally, the screening the optimal sequence of the target MSI sites of each group includes: And aiming at each effective grouping, executing multi-layer priority selection operation in the groups, and screening out the optimal sequence of the target MSI sites of each group, wherein the multi-layer priority selection comprises progressive comparison in each group according to the priority sequence of a plurality of preset indexes. Optionally, the plurality of preset indicators includes a number of sequences, an alignment quality value, an average alkali matrix value, and an alignment length, and the multi-layer priority selection includes: Selecting the sequencing sequence with the largest number of sequences supporting the same repeated base length; Selecting a sequencing sequence with a higher alignment quality value when the number of sequences is the same; When the aligned quality values are the same, selecting a sequencing sequen