CN-122023793-A - Text guide image segmentation method and system

CN122023793ACN 122023793 ACN122023793 ACN 122023793ACN-122023793-A

Abstract

The invention provides a text-guided image segmentation method and a text-guided image segmentation system, belonging to the field of computer vision and image segmentation; the method comprises the steps of processing a training image set, generating corresponding training data units for each foreground category in each training image, training a dual-semantic focus alignment model based on the training data units, executing iterative automatic segmentation on a given input image and a text instruction of a target to be segmented, and outputting a final segmentation mask, wherein the data preprocessing module is sequentially connected with the dual-semantic focus alignment module and the iterative automatic segmentation module in a communication mode.

Inventors

LIU XUEYU
SHI GUANGZE
XIA SHUQI
WEI MINGQIANG
WU YONGFEI
WANG RUI

Assignees

太原理工大学

Dates

Publication Date: 20260512
Application Date: 20260109

Claims (10)

1. A text-guided image segmentation method, characterized by comprising the steps of: Step S1, processing a training image set, and generating a corresponding training data unit for each foreground category in each training image, wherein the training data unit comprises image visual characteristics, a front Jing Leibie mask, a background area mask, a structure semantic point set and a text embedding vector; Step S2, training a dual-semantic focus alignment model based on the training data unit, wherein the dual-semantic focus alignment model is configured to align the whole semantic code of the front Jing Leibie mask with the aggregate representation of the structure semantic point set to generate a visual semantic focus vector representing the whole foreground category and the internal structure semantic; step S3, for a given input image and a text instruction of an object to be segmented, executing the following substeps: s31, acquiring a unified semantic focus vector corresponding to the text instruction based on the trained dual semantic focus alignment model; S32, calculating the similarity between the unified semantic focus vector and each local region feature in the image visual features of the input image to form a semantic similarity response graph, and screening out an initial candidate region set based on a preset text similarity threshold; S33, performing iterative segmentation, namely selecting a position with highest semantic similarity from a current candidate region set as a prompt point in each round of iteration, inputting a segmentation model to obtain a candidate segmentation mask, calculating the maximum similarity between the region features of the candidate segmentation mask and the region features of the accepted mask set, and accepting the candidate segmentation mask if the maximum similarity exceeds a preset consistency threshold; S34, repeating the step S33 until a preset termination condition is met, and outputting the final segmentation mask.
2. The text-guided image segmentation method according to claim 1, wherein the step S1 of generating the set of structural semantic points specifically includes performing cluster analysis on visual features corresponding to all pixels in each foreground class mask region, and taking an obtained cluster center as the set of structural semantic points.
3. The text-guided image segmentation method according to claim 1, characterized in that the joint optimization in step S2 specifically comprises: by means of a text-vision alignment loss function, the distance between a text embedded vector and a vision semantic focus vector which are correctly paired is shortened, and meanwhile the distance between incorrect paired text embedded vectors is shortened; The visual semantic focus vectors are constrained by a structure constraint loss function to simultaneously meet the conditions that distances between the corresponding structure semantic point sets are as close as possible, distances between the corresponding structure semantic point sets and visual semantic focus vectors of different categories are as far as possible, and distances between the corresponding structure semantic point sets and background semantic representations obtained based on the background area masks are as far as possible.
4. A text-guided image segmentation method according to claim 3, characterized in that the background semantic representation obtained based on the background region mask is obtained by selecting a plurality of sampling points from the image region corresponding to the background region mask, and obtaining visual semantic focus vectors corresponding to the sampling points.
5. The text-guided image segmentation method according to claim 1, wherein the calculating the region features of the candidate segmentation mask in step S33 includes averaging and pooling the image visual features of the region covered by the candidate segmentation mask to obtain the region features of the candidate segmentation mask.
6. The text-guided image segmentation method according to claim 1, wherein the preset termination condition in step S34 includes at least one of: The highest semantic similarity in the current candidate region set is lower than the text similarity threshold; the maximum similarity of the newly generated candidate segmentation masks is below the consistency threshold; the iteration number reaches the preset maximum iteration round number.
7. A text-guided image segmentation system, comprising: The data preprocessing module is used for processing the training image set and generating a corresponding training data unit for each foreground category in each training image, wherein the training data unit comprises image visual characteristics, a front Jing Leibie mask, a background area mask, a structure semantic point set and a text embedding vector; The dual-semantic-focus alignment module is connected to the data preprocessing module and is used for receiving the training data unit, and is configured to align the whole semantic code of the front Jing Leibie mask with the aggregate representation of the structure semantic point set to generate a visual semantic focus vector of each foreground category; The automatic segmentation module is connected to the dual-semantic-focus alignment module and used for executing segmentation in an inference stage, receives an input image, a text instruction of a target to be segmented and a unified semantic focus vector which is output by the dual-semantic-focus alignment module after training and corresponds to the text instruction, and is configured to iteratively generate prompt points and input a segmentation model according to the similarity between the unified semantic focus vector and each local area feature in the visual image features of the input image, verify and fuse the generated segmentation result and finally output a segmentation mask.
8. The text-guided image segmentation system of claim 7, wherein the data preprocessing module comprises: The mask labeling unit is used for generating a binary mask of each foreground category as the front Jing Leibie mask based on pixel-level labeling, and generating the background area mask according to complements of all foreground category masks; a text encoding unit, configured to encode the descriptive text of each foreground category using a text encoder, and generate the text embedding vector; The feature extraction and pooling unit is used for extracting a deep feature map of the training image by using a visual backbone network as the visual features of the image; and the structure clustering unit is used for clustering all pixel characteristics in each foreground category mask area, generating a clustering center representing an internal component part of the category, and forming the structure semantic point set.
9. The text-guided image segmentation system of claim 7, wherein the dual semantic focus alignment module further comprises a structural constraint unit to apply structural constraints to the visual semantic focus vectors during a training phase, the structural constraints comprising intra-class compactness constraints to pull-in distances between the visual semantic focus vectors and their corresponding sets of structural semantic points, inter-class separation constraints to pull-out distances between the visual semantic focus vectors of different classes, and foreground-background suppression constraints to pull-out distances between the visual semantic focus vectors and background semantic representations obtained based on the background region mask.
10. The text-guided image segmentation system of claim 7, wherein the iterative automatic segmentation module comprises: the similarity calculation unit is used for calculating the similarity between the unified semantic focus vector and each local area feature in the image visual features of the input image and generating a semantic similarity response chart; The prompting point selecting unit is used for selecting a region with highest semantic similarity from the current undivided region set in each iteration, and mapping the region into pixel coordinates to serve as a prompting point; The segmentation and verification unit is used for inputting the prompt points into a segmentation model to obtain a candidate segmentation mask, and pooling the visual features of the image of the coverage area of the candidate segmentation mask to obtain the regional features of the candidate segmentation mask; and the region updating unit is used for fusing the accepted candidate segmentation mask to the final segmentation mask and removing the region covered by the candidate segmentation mask from the current non-segmented region set.

Description

Text guide image segmentation method and system Technical Field The invention provides a text-guided image segmentation method and a text-guided image segmentation system, belonging to the technical field of computer vision and image segmentation Background In recent years, large-scale basic models have been widely used in the field of computer vision, with the DINO series, SAM (SEGMENT ANYTHING Model), and its modified version SAM2 showing prominence in multi-scene image segmentation tasks. However, the segmentation process of the model still depends on prompt signals such as points, frames or masks provided by users seriously, and inconsistent prompt strategies often leads to obvious segmentation effect fluctuation. In the application scene of processing a large number of images or unattended operation, the manual prompt is relied on, so that the efficiency is low, the time and the labor are consumed, and the actual deployment value of the model is restricted. On the other hand, the performance of the basic model in the professional fields of medical imaging and the like is generally reduced due to large data distribution difference. In order to improve the cross-domain adaptive capacity, researchers put forward efficient fine tuning methods of parameters such as an Adapter (LoRA), a low-rank Adapter (LoRA) and the like, and the model performance is improved by introducing domain priori knowledge. However, the method generally needs to rely on tens to hundreds of marked images for optimization, so that the marking cost is high, the problem of dependence of a model on manual prompting is not fundamentally solved, and the influence of prompting quality on a segmentation result is still obvious. To reduce the burden of manual cues, some studies have attempted to automatically generate cue points based on visual feature matching. For example, the methods of Matcher, PPO, etc. select the hint location by calculating the local feature similarity between the reference image and the target image. However, local feature matching is susceptible to structural differences, texture variations, and visual noise, resulting in instability of the generated cue points. When the shape of the target object is greatly different from that of the reference image, the accuracy of automatic prompt is obviously reduced, and in the professional fields of medical images and the like, the reliability of automatic prompt is further reduced due to the limited feature expression capability of the pre-training model. In addition, text-driven segmentation methods (e.g., trident, talk2DINO, etc.) attempt to achieve target segmentation from text descriptions using a graph-text alignment mechanism. However, the method generally depends on complete category vocabulary or accurate text description, is difficult to adapt to the scene without category or command blurring, and partial method also needs to generate a rough mask firstly by means of models such as CLIP and the like, then refine the rough mask by SAM, and finally the segmentation effect is severely limited by the quality of the initial mask. In addition, the method relies on large-scale image-text data for pre-training, so that the cost is high when the method is moved to a new task, and the method is difficult to quickly adapt to the actual application requirements. In summary, the prior art has the limitations of strong dependence on manual prompts, weak generalization capability across fields, insufficient automatic prompt generation stability, and oversensitivity to the quality of text descriptions. On the premise of not relying on manual prompt, the method realizes stable guiding of a target structure based on a text instruction, and obtains a robust and consistent segmentation effect in a multi-field scene, and is still an important challenge facing the current image segmentation technology. Disclosure of Invention The invention provides a text-guided image segmentation method and a system based on instruction-focus-prompt cooperation, which are used for solving the technical problems of strong prompt dependence, weak cross-domain generalization, unstable automatic prompt generation, excessively high dependence on text description and the like of the existing image segmentation method. In order to solve the technical problems, the technical scheme adopted by the invention is that the text-guided image segmentation method comprises the following steps: Step S1, processing a training image set, and generating a corresponding training data unit for each foreground category in each training image, wherein the training data unit comprises image visual characteristics, a front Jing Leibie mask, a background area mask, a structure semantic point set and a text embedding vector; Step S2, training a dual-semantic focus alignment model based on the training data unit, wherein the dual-semantic focus alignment model is configured to align the whole semantic code of the front Jing Leibie m