CN-121998939-A - Pathological image processing method and system based on progressive cross-modal semantic interaction

CN121998939ACN 121998939 ACN121998939 ACN 121998939ACN-121998939-A

Abstract

The invention relates to the technical field of medical image processing, and discloses a pathological image processing method and system based on progressive cross-modal semantic interaction, wherein the method comprises the following steps: preprocessing and deep registration are carried out on HE and mIF images, and a high-quality pairing data set is constructed by utilizing mutual information screening. Then, a progressive cross-mode semantic interaction module is constructed, HE morphology and mIF molecular characteristics are fused through a two-stage cross-attention mechanism, and semantic enhancement characteristics are generated. The pre-training adopts a double-branch framework, and local structure and global contrast learning alignment mode semantics are reconstructed by combining mask self-coding. In addition, perceptual and fine-grained auxiliary losses are introduced, learning from multi-scale perceptual quality and local region correspondence constraint features, respectively. The invention effectively solves the problem of cross-modal alignment, and the generated characteristics have both anatomical structures and molecular semantics, thereby improving the accuracy and generalization capability of pathological image analysis.

Inventors

WANG SHUO
WANG CHENGHAO
LIU CHENGCAI
WU GE
HUANG XINGYU
Sang Haolin
XING HAO

Assignees

北京航空航天大学

Dates

Publication Date: 20260508
Application Date: 20260127

Claims (10)

1. A pathological image processing method based on progressive cross-modal semantic interaction is characterized by comprising the following steps: S1, carrying out channel rearrangement and pseudo-color rendering and correction on multiple immunofluorescence (mIF) images to generate standardized mIF images, and carrying out spatial registration on hematoxylin-eosin (HE) images and the standardized mIF images by using a deep learning framework; S2, performing gridding traversal on the registered full slices to extract paired image blocks in batches, calculating mutual information values among the image blocks, and reserving paired data meeting a threshold value; S3, extracting original features of HE and mIF image blocks by using an encoder, and generating semantic enhancement features by a semantic injector through firstly establishing mode dependence and then establishing a two-stage progressive cross-attention mechanism combining contexts; S4, pre-training semantic enhancement features by adopting a double-branch framework, wherein the pre-training comprises mask self-coding branches for carrying out mask reconstruction on HE features and global comparison learning branches for aligning global semantics among modes; S5, extracting feature calculation perception auxiliary loss constraint high-level semantic structures, and simultaneously extracting local image block features to calculate fine granularity contrast loss so as to learn local cross-modal correlation.
2. A method of pathological image processing based on progressive cross-modal semantic interaction according to claim 1, characterized in that in said step S1 the preprocessing of the mIF image comprises: Establishing a predefined standardized channel name list and an arrangement sequence, analyzing all channels from original mIF image data, traversing a predefined target channel sequence, searching for a matching channel in the original image data, and filling the non-found matching channel with all 0 values to generate a standardized channel sequence; Designating pseudo color composed of red, green and blue primary color components for each standardized channel, performing pixel-by-pixel multiplication operation on the image data and the pseudo color components after intensity normalization processing, and performing brightness and gamma correction by applying a linear multiplication factor and a power law function; The depth registration includes resampling and color normalization of the HE image and the preprocessed mIF image, calculating a general spatial correspondence between the source image and the target image for initial alignment, calculating a non-rigid transformed field for deformable registration, and applying a deformation field to complete final deformation at the original resolution.
3. A pathological image processing method based on progressive cross-modal semantic interaction according to claim 1, wherein in the step S2, the calculation process of the mutual information value comprises: normalizing and quantizing the pixel values of the images to a preset level to construct discrete probability distribution, and calculating a joint histogram of the two images and an edge histogram of each image; Calculating the respective entropy of each image based on the edge histogram, calculating the joint entropy of the two images based on the joint histogram, and subtracting the joint entropy after adding the respective entropy of the two images to obtain the mutual information value.
4. The pathological image processing method based on progressive cross-modal semantic interaction according to claim 1, wherein in the step S3, the HE encoder and the mIF encoder both adopt a visual transducer architecture comprising a self-attention layer and a self-attention head and load pre-trained initial weights; Before extracting the original feature sequence, performing enhancement and normalization operations on the input image block including performing horizontal flipping, performing random rotation within a preset angle range, introducing random brightness and contrast perturbations, adjusting to a uniform size and mapping pixel values to between 0 and 1, and performing normalization processing using the mean and standard deviation of the preset dataset.
5. The method for processing the pathological image based on the progressive cross-modal semantic interaction according to claim 1, wherein in the step S3, the first stage of the two-stage progressive cross-attention mechanism comprises the steps of projecting an original feature sequence of an HE image into a query space, respectively projecting the original feature sequence of an mIF image into a key space and a value space, updating the HE feature by calculating attention weights; The second stage of the progressive cross-attention mechanism comprises the steps of splicing an original feature sequence of an HE image and an original feature sequence of an mIF image, generating a combined context sequence through linear layer projection, projecting the combined context sequence into a query space, respectively projecting HE features and mIF features updated in the first stage into a key space and a value space, performing cross-attention calculation, and outputting a cross-modal semantic feature sequence; The semantic injector converts semantic information contained in the cross-modal semantic feature sequence through a multi-layer perceptron, and respectively injects the semantic information into original feature sequences of the HE image and the mIF image to generate semantically enhanced hematoxylin-eosin features and multiple immunofluorescence features.
6. A pathological image processing method based on progressive cross-modal semantic interaction according to claim 1, characterized in that in step S4, the construction process of the masking self-encoding branch comprises: extracting image block features representing image local areas from the semantically enhanced hematoxylin-eosin features, and randomly covering the image block features according to preset probability; Inputting the residual uncovered image block characteristics into a decoder, reducing the dimension through a linear layer, then performing self-attention layer processing, projecting again to restore to the original dimension, and predicting the original content of the covered image block; The pixel-level mean square error between the content of the covered image block predicted by the decoder and the real content of the corresponding image block in the original HE image is calculated as a reconstruction loss.
7. The pathological image processing method based on progressive cross-modal semantic interaction according to claim 1, wherein in the step S4, the construction process of the global contrast learning branch comprises: Extracting category tokens representing overall semantics from the semantically enhanced hematoxylin-eosin features and the semantically enhanced multiple immunofluorescence features respectively to serve as global features; calculating cosine similarity between the HE image global features and the mIF image global features to form a similarity matrix; And symmetrical cross entropy loss is adopted, on one hand, HE image global features are compared with all mIF image global features, on the other hand, mIF image global features are compared with all HE image global features, the similarity between paired images is maximized, and the similarity between unpaired images is minimized.
8. A pathological image processing method based on progressive cross-modal semantic interaction according to claim 1, wherein in step S5, the process of constructing the perceptual auxiliary loss comprises: Respectively acquiring HE original images and characteristic images output by the reconstructed HE images in different middle layers by using a pre-trained and parameter frozen residual error network as a characteristic extractor; the extracted multi-layer features are subjected to fusion treatment to respectively obtain HE original image sensing features and HE reconstruction image sensing features; And subtracting the two fusion feature tensors element by element, calculating the average value of square variances, and calculating the L2 distance of the feature layer as the perception auxiliary loss.
9. A pathological image processing method based on progressive cross-modal semantic interaction according to claim 1, wherein in step S5, the process of constructing the fine-grained contrast loss comprises: removing class tokens and register tokens from the semantically enhanced hematoxylin-eosin feature sequence and the semantically enhanced multiple immunofluorescence feature sequence, and reserving an image block token; flattening the reserved local feature tensor to form a feature vector list, and carrying out L2 normalization; Calculating the inner product between the flattened HE feature matrix and the transposed mIF feature matrix by utilizing matrix multiplication to generate a similarity matrix, and multiplying the similarity matrix by a leavable scaling factor; And regarding each HE image block as an anchor point, regarding the corresponding position mIF image block as a positive sample, regarding the rest image blocks as negative samples, respectively calculating differences in two directions by using a cross entropy loss function and taking an average value to obtain the image block level fine granularity contrast loss.
10. A pathology image processing system based on progressive cross-modal semantic interaction, comprising: the preprocessing and depth registration module is used for performing data standardization and spatial alignment of the HE image and the mIF image; The high-quality pairing data set construction module is used for constructing an image block data set for model training and screening pairing data based on mutual information; The progressive cross-modal semantic interaction module is used for extracting and fusing cross-modal characteristics of HE and mIF images by utilizing a two-stage cross-attention mechanism; The double-branch comparison pre-training module is used for carrying out feature learning and optimization on the model through mask self-coding and global comparison learning; And the auxiliary loss calculation module is used for calculating the perception auxiliary loss and the fine granularity contrast loss so as to restrict model training.

Description

Pathological image processing method and system based on progressive cross-modal semantic interaction Technical Field The invention relates to the technical field of medical image processing, in particular to a pathological image processing method and system based on progressive cross-modal semantic interaction. Background At present, in the fields of tumor microenvironment analysis and clinical prognosis evaluation, pathological image analysis occupies a decisive role. Hematoxylin-eosin (HE) staining is used as a gold standard for morphological diagnosis, can provide abundant tissue structure information, and is a main means of clinical routine detection. Meanwhile, multiple immunofluorescence (mIF) staining techniques can label multiple biomarkers simultaneously, revealing complex molecular expression and functional status between cells. Although HE images have low acquisition cost and high popularity, the HE images can only show morphological characteristics, and are difficult to directly reflect deep molecular biological information, while mIF images can provide gold standards at a molecular level, but are difficult to popularize and apply in large-scale clinical screening due to complex preparation process and high cost. For the application of the multi-mode data, the prior art solutions are mostly dedicated to spatial association mining of HE images and mIF images. Conventional procedures typically employ image registration algorithms to coordinate the two differently stained slices over the anatomy. In deep learning applications, the mainstream method mainly utilizes convolutional neural network or a transducer architecture to independently extract the characteristics of the HE image, or tries to construct a mapping relationship from HE to mIF by generating an antagonism network. Part of the research adopts a contrast learning strategy to try to shorten the distance between two modes in a potential feature space, and aims to assist a model to learn the commonality expression between images by using paired data. However, the prior art solutions still have limitations in processing cross-modal pathology images. First, conventional registration methods often have difficulty accurately coping with non-rigid deformations generated during the manufacturing process, and lack effective quantitative screening of registration quality, resulting in significant noise and mismatching in the training data. Secondly, the existing cross-modal fusion strategy is characterized by multi-stay in shallow feature splicing or simple global alignment, a progressive interaction mechanism is lacking, and the gradual guiding action of molecular semantics on morphological features cannot be deeply excavated, so that feature fusion is insufficient. In addition, a single pre-training task is difficult to consider global semantics and local details, and the existing loss function often ignores the corresponding relation between multi-scale perception quality constraint and fine-grained region, so that the extracted characteristics of the model are insufficient in biological interpretation and generalization capability of downstream tasks. Therefore, the invention provides a pathological image processing method and system based on progressive cross-modal semantic interaction, which solve the defects in the prior art. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a pathological image processing method and system based on progressive cross-modal semantic interaction, which solve the problems of lack of molecular semantic information, difficulty in cross-modal data space alignment and weak generalization capability caused by insufficient feature fusion depth of HE images in the existing pathological image analysis. In order to achieve the above purpose, the invention is realized by the following technical scheme: in a first aspect, the invention provides a pathological image processing method based on progressive cross-modal semantic interaction, which adopts the following technical scheme: A pathological image processing method based on progressive cross-modal semantic interaction comprises the following steps: S1, carrying out channel rearrangement and pseudo-color rendering and correction on multiple immunofluorescence (mIF) images to generate standardized mIF images, and carrying out spatial registration on hematoxylin-eosin (HE) images and the standardized mIF images by using a deep learning framework; step S2, performing gridding traversal on the registered full slices to extract paired image blocks in batches, calculating mutual information values among the image blocks and reserving paired data meeting a threshold value; S3, extracting original features of HE and mIF image blocks by using an encoder, and generating semantic enhancement features by a semantic injector through firstly establishing mode dependence and then establishing a two-stage progressive cross-attention mechanism comb