CN-122024856-A - Multi-group-of-study cancer subtype identification method, system, equipment and medium based on density perception cluster structure guided contrast learning

CN122024856ACN 122024856 ACN122024856 ACN 122024856ACN-122024856-A

Abstract

A multi-group cancer subtype identification method, system, equipment and medium based on density perception cluster structure guide contrast learning belong to the technical field of bioinformatics and artificial intelligence intersection. The method comprises the following steps of multi-group data acquisition and preprocessing, construction of a group-study specific self-encoder and learning of potential representation, construction of a density sensing cluster block in potential space, construction of a group-crossing positive and negative sample pair based on cluster block sample overlapping, difficult negative sample mining, construction of a cluster block-level group-crossing comparison learning target and training updating, introduction of a self-supervision soft refinement mechanism to dynamically enhance a cluster structure, and multi-loss joint optimization and model iterative training. The method can improve the robustness and stability of the cancer subtype identification result, can carry out high-efficiency modeling and analysis on high-dimensional, multi-source and multi-noise multi-group data, has good generalization capability and application potential, and can provide reliable technical support for cancer typing research, layered analysis of patients and accurate medical auxiliary decision.

Inventors

ZHU SHUWEI
WEI JIAXIN
YI JUNFENG
Yang Shuaihao
Sheng Zifan
LU HENGYANG
FANG WEI

Assignees

江南大学

Dates

Publication Date: 20260512
Application Date: 20260323

Claims (9)

1. A multi-group cancer subtype identification method based on density perception cluster structure guided contrast learning is characterized by comprising the following steps: step 1, multi-group data acquisition and preprocessing, namely acquiring a plurality of groups of data sets of the same batch of patient samples to form a multi-group matrix: Wherein, the In order to obtain the number of samples, For the number of the group to learn, Is the first The study of the individuals in the group, Is the first Performing missing value filling and normalization processing on each histology to zoom the histology characteristics to the interval Aligning the sample index IDs, removing the non-survival information samples and the repeated samples, and outputting the aligned multiple groups of chemical matrixes ; Step 2, constructing a group of self-encoders specific to each group and learning potential representation A potential representation and reconstruction input is obtained for each sample: Wherein, the Represent the first In the group of science The samples pass through an encoder The resulting potential representation vector is mapped out and, Represent the first In the group of science The original input feature vector of the individual samples, The representation is decoded by the decoder From potential representation vectors Reconstructing the obtained first In the group of science Reconstructing feature vectors of the individual samples; defining a multi-group mathematical reconstruction penalty: outputting a set of potential representations for each group ; Step 3, constructing density perception cluster blocks in potential space, namely, performing potential representation in each group Performing density clustering on the obtained sample cluster label , wherein, A point of noise is indicated and, Represent the first Grouping the number of effective cluster blocks obtained by density clustering Calculating cluster block center vectors Radius and radius : Wherein, the Represent the first In the group of science Each valid cluster block The number of samples to be included is the number of samples, The index of the sample is represented and, Representing a cluster block index; Constructing a cross-group learning cluster block set: Wherein, the Represent the first Clustering by density in a potential representation space of a group A set of valid cluster blocks; outputting a set of inter-group cluster blocks for each group of students ; Step 4, constructing a cross-group positive and negative sample pair based on cluster block sample overlapping; step 5, difficult negative sample mining; step 6, building cluster block level cross-group study comparison learning targets and training and updating; Step 7, introducing a self-supervision soft refinement mechanism to dynamically enhance the cluster structure; step 8, multi-loss combined optimization and model iterative training, namely carrying out weighted summation on multi-group reconstruction, comparison and soft refinement loss to obtain a total target, carrying out counter propagation update on encoder and decoder parameters of all groups by adopting an optimizer Adam, and carrying out iterative training to the maximum round Group study encoder for outputting training completion ; Step 9, constructing a unified expression and outputting a cancer subtype label, namely splicing potential expressions of each group to obtain the unified expression For a pair of Performing clustering and outputting final subtype labels Based on And (3) carrying out survival analysis and clinical label enrichment test to verify the clinical significance of the subtype, and outputting cancer subtype results, optional evaluation indexes and visual results.
2. The method for identifying multiple groups of cancer subtypes based on density-aware cluster structure-guided contrast learning according to claim 1, wherein the specific steps of constructing a pair of positive and negative samples of a cross-group based on cluster block sample overlap in the step 4 are as follows: Step 4.1 for any valid Cluster Block Record the sample index ID set it contains ; Step 4.2 for any two different histology Respectively taking effective cluster blocks And (3) with Calculating the sample overlapping rate of the two: Step 4.3, presetting an overlap threshold When meeting the following requirements When it will The positive sample pair is judged to be in the group, adding a facing set ; Step 4.4 when When it will As candidate negative sample pairs, add candidate negative pair sets ; And 4.5, if a group of chemical density clusters do not generate effective cluster blocks, regarding all the groups of samples as a cluster block to participate in the overlapping calculation of the steps 4.2-4.4 so as to ensure that the construction process of the positive and negative samples can be executed.
3. The method for identifying multiple groups of cancer subtypes based on density-aware cluster structure-guided contrast learning according to claim 2, wherein the specific steps of difficult negative sample mining in step 5 are as follows: Defining cosine similarity measurement between cluster block center vectors, and measuring the representation similarity between different groups of clusters; Step 5.2 for each Anchor cluster block Traversing its candidate negative pair set Calculating cosine similarity between the center of the anchor cluster block and the center of the candidate negative cluster block according to the cosine similarity; step 5.3, sorting candidate negative cluster blocks according to cosine similarity from high to low, and selecting Top-round with highest cosine similarity Candidate negative cluster blocks form a difficult negative sample set ; Step 5.4 to avoid the influence of the artificially set super parameters, top- And setting the value corresponding to the statistical median of the number of the candidate negative cluster blocks in the candidate negative pair set or setting the value to be an integer value not exceeding a preset upper limit, thereby realizing the self-adaptive determination of the number of the difficult negative samples.
4. The method for identifying multiple groups of cancer subtypes based on density-aware cluster structure-guided contrast learning according to claim 3, wherein the specific steps of constructing cluster block-level cross-group-based contrast learning targets and training and updating in the step 6 are as follows: step 6.1, presetting a temperature coefficient Temperature scaling is carried out on cosine similarity between cluster block center vectors, and positive pair scores and negative pair scores are constructed: Wherein, the Representing the center vector of the anchor cluster block in the v-th histology And the first Positive sample cluster block center vector in personal histology The degree of cosine similarity between the two, Represent the first Anchor cluster block center vector in personal group science And the first Complex of difficult negative sample cluster block center vector The degree of cosine similarity between the two, Represent the first Cluster block center vectors of positive sample pairs are formed with the anchor cluster blocks in the histology, Represent the first Belongs to difficult negative sample collection in personal science Is the first of (2) A cluster block center vector; Step 6.2, the facing set obtained in the step 4 To monitor the signal, the difficult negative sample set obtained in the step 5 is used For the negative sample set, the cluster block level InfoNCE compares the loss function: Wherein, the Indicating the exponential similarity between the anchor cluster block and the positive sample cluster block after temperature scaling, Indicating the exponential similarity between the anchor cluster block and the difficult negative sample cluster block after temperature scaling, Representing a collection Cluster block indexes in (a); And 6.3, updating the parameters of the histology specific self-encoder by adopting a gradient descent optimization method, so that the cosine similarity between cluster block center vectors of the inter-histology positive sample pair is increased, and the cosine similarity of the difficult negative sample pair is reduced, so that the inter-histology semantic alignment and boundary discrimination capability are enhanced.
5. The method for identifying multiple groups of cancer subtypes based on density-aware cluster structure-guided contrast learning according to claim 4, wherein the specific steps of the self-supervised softening refinement mechanism in step 7 are as follows: Step 7.1, marking the center of the cluster block in the histology as Based on Student- Distribution structure soft allocation probability : Wherein, the Representing Student- A distributed degree of freedom parameter; representing a current cluster block index; Representation and cluster blocks Calculating cluster block indexes of soft allocation probability; A cluster block index variable representing the normalized summation; Step 7.2 based on Soft Allocation probability Constructing a target distribution Cluster structure is enhanced by scaling up the high confidence assignment: Wherein, the Representing cluster block index variables in normalized summation for traversal histology All cluster blocks in the row; step 7.3 calculating target distribution With the current distribution KL divergence of (2) to obtain soft refinement loss: And 7.4, using the soft refinement loss for network training to enable cluster block distribution in the group science to be more centralized, and improving the compactness in clusters and the cluster-to-cluster separability, so that the stability and the robustness of the subsequent cluster output are improved.
6. The method for identifying multiple groups of cancer subtypes based on density-aware cluster structure-guided contrast learning according to claim 5, characterized in that the multiple-loss joint optimization in step 8 includes: Wherein, the 、、 Weights corresponding to the loss functions, respectively; The optimizer Adam is used to back-propagate updates to all of the histology encoder and decoder parameters.
7. A multi-group chemical cancer subtype recognition system based on density-aware cluster structure guided contrast learning, comprising: A data acquisition and preprocessing module, a representation learning module, a cluster block construction module, a positive and negative sample pair construction module, a difficult negative sample mining module, a contrast learning training module, a soft refinement optimization module, a multiple-loss joint optimization module, and a subtype output module, wherein each module is configured to perform the method steps of any one of claims 1 to 6.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of a method for identifying multiple groups of chemical cancer subtypes based on density-aware cluster structure-guided contrast learning according to any of claims 1 to 6 when the computer program is executed by the processor.
9. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of a method for identifying multiple sets of chemical cancer subtypes based on density-aware cluster structure-guided contrast learning according to any of claims 1 to 6.

Description

Multi-group-of-study cancer subtype identification method, system, equipment and medium based on density perception cluster structure guided contrast learning Technical Field The invention relates to a multi-group cancer subtype identification method, system, equipment and medium based on density perception cluster structure guided contrast learning. Background Cancer is a highly heterogeneous group of major diseases, and even though derived from the same organ or tissue, different patients may still show significant differences in molecular level, cell regulation mechanism, tumor microenvironment, disease progression track, etc. This heterogeneity is reflected not only in differences in the gene mutation profile and apparent regulatory state, but also in differences in patient response to treatment regimens and in inconsistencies in clinical prognosis results. Therefore, traditional typing methods based on a single clinical index or a single molecular feature often have difficulty in comprehensively characterizing the inherent complexity of cancer. The identification of cancer molecular subtypes helps to reveal potential biological mechanism differences by finely grouping patients, provides important basis for clinical layering diagnosis and treatment, curative effect prediction and personalized treatment strategy formulation, and has become one of the core research directions in the field of accurate medicine. With the rapid development of high-throughput sequencing technology and multi-study measurement platforms, researchers can simultaneously acquire multidimensional molecular data such as gene expression mRNA, DNA methylation, miRNA and the like in the same batch of patient samples, and carry out systematic analysis by combining survival follow-up with clinical phenotype information. The multiple sets of biological data describe the tumor state from different biological levels, have high complementarity and provide an important basis for more comprehensively and accurately identifying the cancer subtype. However, multiple sets of chemical data are often characterized by high dimensionality, strong noise, inconsistent distribution, limited sample size, etc., which also places higher demands on the robustness, stability, and generalization capability of subtype identification methods. Existing methods for identifying multiple chemical subtypes are generally classified into early integration, late integration and mid-integration according to the "stage of integration". The early integration method generally directly splices or weights and fuses the features of different groups into a unified data matrix, and then adopts the traditional methods of clustering/dimension reduction and the like to carry out parting. For example Consensus Clustering builds consensus matrices by resampling and repeated clustering to evaluate stability and aid in determining cluster numbers, multi-factor analysis (MFA) weights and jointly reduces dimensions for different groups at the feature level, JIVE attempts to break up the multi-group matrix into shared and specific parts to compromise consistency and variability. The method has visual flow, but under the scenes of large difference of multiple groups of chemical dimensions, inconsistent dimensions and different noise levels, the problems of high-dimensional redundancy accumulation, noise amplification, fusion bias and the like are easy to occur, so that the clustering stability and clinical interpretability are affected. The late integration method generally performs independent clustering or modeling in each group, and then performs secondary fusion on the results. For example, COCA obtains clustering results on multi-platform histology and fuses on the result level, so as to obtain cross-platform consistent typing, PINSPlus enhances clustering robustness through disturbance and fuses based on a connection matrix. The method can keep the specificity of the histology to a certain extent, but the performance of the method often depends on the quality of single-group clustering and a later fusion strategy due to lack of inter-group combination representation learning and explicit structural constraint, so that the complementary information among multiple groups of the histology is difficult to fully capture, and the fusion process is sensitive to noise and parameter selection. The mid-term integration method realizes multi-group study integration by joint learning in a potential space, and retains the specific information of the groups as much as possible while capturing a shared structure. Typical representatives include iCluster and extensions thereof (e.g., iClusterBayes), and MOFA based on factor models, and the like. In addition to latent variable modeling, there is also a mid-term integration method based on similarity/kernel, such as SNF constructing a similarity network for each group and generating a comprehensive network by iterative fusion of diffusion processes,