CN-122023798-A - Visual language model-based optic disc atrophy arc weak supervision segmentation method

CN122023798ACN 122023798 ACN122023798 ACN 122023798ACN-122023798-A

Abstract

A visual language model-based optic disc atrophy arc weak supervision segmentation method belongs to the field of fundus retina image processing. The method comprises the steps of constructing a bimodal encoder comprising a visual feature encoder and a text encoder, introducing a visual language multimodal feature alignment mechanism, guiding visual characterization learning by using priori knowledge in medical text description, establishing cross-modal association of images and focus semantics in a unified feature space, realizing accurate alignment of visual and text features, generating focus attention heat maps based on alignment features, designing a self-adaptive multichannel prompt generation network, automatically mining positive and negative prompt points with high confidence from the heat maps, taking the generated prompt points as a position guiding input segmentation basic model SAM, and realizing fine extraction of optic disc atrophy arc boundaries under the weak supervision condition of only text description by using strong zero sample segmentation capability of the SAM.

Inventors

LU SHUAI
LI HUIQI
YUAN GUODONG
ZHANG WEIHANG
WANG NINGLI
LIU HANRUO

Assignees

北京理工大学
北京理工大学唐山研究院

Dates

Publication Date: 20260512
Application Date: 20260123

Claims (7)

1. A visual language model-based optic disc atrophy arc weak supervision and segmentation method is characterized by comprising the following steps of: the method comprises the steps of S1, constructing a bimodal encoder comprising a visual encoder and a text feature encoder, configuring the visual encoder to extract multi-scale visual features of fundus images, configuring the text encoder to extract text semantic features of optic disc atrophy arc focuses, and establishing a feature extraction basis; s2, performing contrast learning fine adjustment on the bimodal encoder by utilizing a fundus image-text data set, so that the visual features and the text features are aligned semantically in a unified feature space, and the finely adjusted bimodal encoder is obtained; S3, constructing a self-adaptive multi-channel prompt generation network, utilizing a finely tuned bimodal encoder to generate a self-supervision pseudo tag based on a focus attention heat map, training the network to master the capability of mapping prompt point coordinates from the heat map, and obtaining a trained prompt generation network; S4, acquiring a fundus image to be detected, calculating a similarity matrix of the image and preset text features by using a finely tuned bimodal encoder, and generating a focus attention heat map; S5, inputting the fundus image to be detected and the segmentation prompt points into a segmentation basic model, guiding the segmentation basic model to extract the optic disc atrophy arc boundary by using the prompt points as position guidance, and realizing automatic segmentation of the optic disc atrophy arc under the guidance of weak text supervision.
2. The method according to claim 1, wherein the method for constructing a visual encoder in S1 is specifically as follows: Using visual transducer as frozen main network, inputting fundus image Divided into With a size of Mapping each of said image blocks into image blocks by a linear projection layer And a visual adapter is inserted in parallel between every two transducer coding blocks of the visual transducer, thus obtaining the visual encoder.
3. The method according to claim 1, wherein the method for constructing a text feature encoder in S1 specifically comprises the following steps: the method comprises the steps of adopting a pre-trained large-scale language model as a frozen backbone network, carrying out word segmentation on input text description information related to the video disc atrophy arc PPA, mapping each word element after word segmentation into a D-dimensional embedded vector through an embedded layer, forming a text vector sequence, inputting the text vector sequence into a transform coding layer of the backbone network, and inserting text adapters among a plurality of transform layers of the language model to obtain a text feature encoder.
4. The method of claim 1, wherein the specific method for establishing the cross-modal feature alignment mechanism and performing fine tuning optimization in S2 is as follows: By constructing a comprehensive loss function For the image features obtained in the step S1 And text features And performing joint optimization training, wherein the comprehensive loss function consists of global feature alignment loss, local feature alignment loss, cross-level consistency loss and regularization term, and the specific calculation steps are as follows: Step S2.1 building Global feature alignment loss At the whole level, the similarity of positive sample image-text pairing is improved through contrast learning, and a loss function is defined as: Wherein, the Is of batch size; Is the first Global feature vectors of the individual images; Is the first Positive sample text feature vectors corresponding to the respective images; as any arbitrary first in a batch A text feature vector; representing a cosine similarity function; is a temperature super parameter; step S2.2 construction of local feature alignment loss In the local level, the local features of the image and the text vocabulary features are subjected to weak supervision matching by adopting a multi-example learning strategy, and as pixel-level labeling is not used, the area with the highest response value in the image is automatically screened by the loss function to be used as a potential focus area for optimization, and the method is defined as follows: Wherein, the A set of text tokens describing PPA lesions; Is the position A visual feature vector at; Is a word element Is a text feature vector of (1); is the image feature map with the highest similarity with focus text The loss leads the model to automatically focus on and position the focus area by maximizing the similarity between the high response area and the focus text; step S2.3 construction of Cross-level consistency loss At the cross-layer constraint level, minimizing the difference between shallow and deep feature alignment distributions, the loss function is defined as: Wherein, the A set of network levels for participation in the alignment; Is the first Similarity distribution of layer visual features and text features; A reference similarity distribution as a reference; Is Kullback-Leibler divergence; Is the first The weight coefficient of the layer; Step S2.4 calculating the comprehensive loss function Wherein, the Is a weight super parameter, meets the following conditions ; For a set of adapter parameters Norms regularize terms.
5. The method of claim 1, wherein the constructing and training method of the adaptive multi-channel hint generating network in S3 specifically comprises: step 3.1, constructing a network structure, wherein the self-adaptive multi-channel prompt generation network adopts an encoder-decoder structure, and inputs the single-channel attention heat map for training The heat map is derived from step S2.2 The output of the self-adaptive multi-channel prompt generation network is a dual-channel prompt probability map The dual-channel prompt probability map comprises a forward prompt probability map And negative prompt probability map The method comprises the steps of respectively corresponding to a focus boundary area predicted by a model and a background area with high confidence coefficient, and realizing collaborative learning of boundary recognition and background suppression in the same network structure; step 3.2 constructing a self-supervising pseudo tag from the attention heat map using an image processing algorithm Automatically generating pseudo tag sets Attention heat map Computing gradient magnitude maps using gradient operators (e.g., sobel or Scharr) And is opposite to Thresholding and morphological skeletonizing refinement are carried out to obtain a boundary outline with single pixel width, which is used as a forward prompt pseudo tag Heat map of pair Normalization and inversion are carried out to obtain a background confidence map For a pair of Thresholding, and reserving a high confidence region as a negative prompt pseudo tag ; Step 3.3 network end-to-end training to construct a two-channel composite loss function including positive and negative weights Optimizing the network: Wherein, the Selected from the Dice Loss or Focal Loss to alleviate foreground and background pixel imbalance problems; And (3) with Is a weight coefficient.
6. The method of claim 1, wherein the generating focus attention heat map in S4 and the generating of the cue points are specifically implemented by: step 4.1, extracting bimodal characteristics of an reasoning stage, and obtaining fundus images to be detected The visual encoder input to the dual-mode encoder trimmed in step S2 extracts a partial visual feature map Wherein Simultaneously, inputting a preset target text prompt of the optic disc atrophy arc into a text encoder of the dual-mode encoder after fine tuning in the step S2 to obtain a corresponding global text feature vector ; Step 4.2, calculating the text feature vector by calculating the image-text feature similarity matrix And the local visual characteristic diagram Each spatial location feature of (a) Cosine similarity between the two to obtain two-dimensional similarity matrix : Wherein, two-dimensional similarity matrix Each position in the matrix The higher the value of the numerical value is the degree of correlation between the local area of the image and the preset text semantic of the 'optic disc atrophy arc', the higher the probability that the area is a PPA focus is; Step 4.3, generating a normalized attention heat map, and carrying out matrix analysis on the two-dimensional similarity Upsampling to the same resolution as the input image And carrying out normalization treatment to obtain a final focus attention heat map The heat map intuitively reflects the space distribution probability of PPA focus and is used for realizing the coarse positioning of focus areas; Step 4.4, generating prompt points, and generating an attention heat map of the image to be detected Inputting the trained network to obtain a dual-channel output Forward hint probability map in two-channel output Performing non-maximum suppression NMS and Top-K screening, selecting the highest response value The points are used as positive prompt points, and the probability map is prompted in the negative direction Performing non-maximum suppression and Top-K screening, selecting And taking the background points with high confidence as negative prompt points.
7. The method of claim 1, wherein the specific implementation method for performing pixel-level fine segmentation by using the hint points in S5 is as follows: step 5.1, constructing multi-mode input feature embedding, inputting the fundus image preprocessed in the step 1 into an image encoder of a segmentation basic model SAM, and mapping a high-resolution image into low-resolution image feature embedding Simultaneously, the positive prompt point set and the negative prompt point set extracted in the step S4 are input to a prompt encoder for dividing the basic model, and the spatial coordinate information of the prompt points is mapped into sparse prompt features by utilizing the leachable position codes to be embedded Wherein positive cue points are encoded as foreground features and negative cue points are encoded as background features to distinguish between focal and non-focal regions in the feature space; Step 5.2, feature fusion and mask decoding to embed the image features And prompt feature embedding The model is input into a mask decoder of the segmentation basic model for feature fusion and interaction, and the model outputs a predicted focus probability map; And 5.3, performing Sigmoid activation function processing on the probability map for a binary segmentation mask, and converting the binary mask into a binary mask by setting a probability threshold, wherein a connected region with a pixel value of 1 in the binary mask is a final predicted video disc atrophy arc PPA focus region, and outputting focus segmentation results with a fine edge structure through positioning of a focus main body by positive prompt and suppression of background noise by negative prompt.

Description

Visual language model-based optic disc atrophy arc weak supervision segmentation method Technical Field The invention belongs to the field of fundus retina image processing, and relates to a visual language model-based optic disc atrophy arc weak supervision and segmentation method. Background Optic disc atrophy arc (PERIPAPILLARY ATROPHY, PPA) is an important pathological feature of ocular fundus diseases such as glaucoma and high myopia, and is manifested by degenerative changes of the retinal pigment epithelium and choroid around the optic disc. Its morphological features (such as atrophy area, boundary morphology and Beta region range) are closely related to the occurrence and development of disease. Therefore, the accurate segmentation of the optic disc atrophy arc has critical clinical significance for glaucoma early screening, disease course monitoring and personalized treatment planning. However, the traditional clinical diagnosis relies on doctors to manually delineate focus areas in fundus images, and has the problems of low efficiency, strong subjectivity, poor repeatability and the like, and an automatic segmentation technology is needed to assist diagnosis. At present, an automatic segmentation method for the optic disc atrophy arc is mainly divided into two major categories, namely a traditional image processing technology and a full-supervision deep learning model. The traditional method is mostly based on edge detection, region growth or threshold segmentation, and combines prior shape constraints such as ellipse fitting and the like. The method relies on the characteristics of manual design, is extremely sensitive to image noise, uneven illumination and contrast variation, and is difficult to treat the conditions of fuzzy atrophy arc boundary and changeable morphology. Fully supervised deep learning models (e.g., U-Net and its variants), while significantly superior in performance to traditional approaches, are highly dependent on large-scale, high quality pixel-level labeling data for training. In the analysis of fundus images, acquiring pixel-level labels not only requires a great deal of time input by a professional ophthalmologist, but also has obvious differences in labels of different doctors and even the same doctor at different times due to complex spatial relationship between PPA regions and structures such as optic discs, retinal blood vessels and the like and progressive fuzzy transition of focus boundaries. The inconsistency of the high cost and the labeling of the data acquisition severely restricts the generalization capability and the actual deployment of the full-supervision model in a clinical environment. To solve the above problem, the weak supervised learning technique (Weakly Supervised Learning) is becoming a research hotspot, which aims to achieve segmentation with low-cost labels such as image-level labels, bounding boxes, or sparse points. However, the existing weak supervision segmentation method is still faced with serious challenges when being directly applied to optic disc atrophy arcs, on one hand, the positioning map generated by the existing Class Activation Mapping (CAM) based method is low in resolution and rough in boundary, and is difficult to capture PPA fine edge structures, on the other hand, a simple attention mechanism is difficult to effectively model complex multi-level context dependence between PPA and anatomical structures such as optic discs and blood vessels, and topological structure errors (such as holes or cracks) are easy to occur in segmentation results. Therefore, how to realize high-precision and robust segmentation of the optic disc atrophy arc by only using low-cost weak supervision signals and combining clinical semantic priori knowledge is a technical problem to be solved currently. Disclosure of Invention In order to solve the problems that the existing video disc atrophy arc (PPA) segmentation method has high dependence on high-cost pixel-level labeling and the traditional weak supervision method is inaccurate in focus boundary identification and insufficient in generalization capability, the invention aims to provide the video disc atrophy arc weak supervision segmentation method based on a visual language model, which can realize high-precision and robust segmentation of video disc atrophy arcs by utilizing low-cost weak supervision signals and combining clinical semantic priori knowledge. The invention aims at realizing the following technical scheme: the invention discloses a visual language model-based visual disc atrophy arc weak supervision segmentation method, which adopts a framework of semantic feature alignment-self-adaptive prompt generation-basic model fine segmentation. Firstly, constructing a bimodal encoder comprising a visual feature encoder and a text encoder, introducing a visual language multimodal feature alignment mechanism, guiding the learning of visual features by using priori knowledge (such as focus posi