CN-121982321-A - Image segmentation method and device based on VLM and language interaction

CN121982321ACN 121982321 ACN121982321 ACN 121982321ACN-121982321-A

Abstract

The invention provides an image segmentation method and device based on VLM and language interaction, and the method comprises the steps of carrying out semantic understanding on natural language instructions to generate a plurality of language query vectors containing instruction intentions, carrying out multi-scale feature extraction on scene images corresponding to the natural language instructions to generate multi-scale initial feature images, inputting the language query vectors and the initial feature images into a cross-modal bridging module of an image segmentation model to obtain semantic feature images and aligned query vectors which are output after cross-modal operation on the language query vectors and the initial feature images, inputting the semantic feature images and the aligned query vectors into a mask decoder of the image segmentation model to obtain first segmentation masks logits of the language query vectors which are output on the basis of cross-attention reasoning of the aligned query vectors, and outputting pixel-level segmentation masks of examples or semantics to realize more accurate grabbing or assembling.

Inventors

LAI YUEHUI
SUN JING
WANG GUN
TAN DONG

Assignees

广电运通集团股份有限公司

Dates

Publication Date: 20260505
Application Date: 20260130

Claims (10)

1. An image segmentation method based on VLM and language interaction, comprising: Inputting natural language instructions into a semantic analysis module of an image segmentation model to obtain a plurality of language query vectors which are output by the semantic analysis module and contain instruction intents; Inputting a scene image corresponding to the natural language instruction into a multi-scale feature extraction module of the image segmentation model to obtain a multi-scale initial feature map output by the feature extraction module; Inputting each language query vector and an initial feature map into a cross-modal bridging module of an image segmentation model to obtain an aligned query vector with multiple aligned features and multiple semantic feature maps, wherein the aligned query vector is output by the cross-modal bridging module after cross-modal operation is carried out on the language query vector and the initial feature map, and the multiple semantic feature maps are in one-to-one correspondence with the aligned query vector; inputting a plurality of semantic feature graphs and corresponding aligned query vectors into a mask decoder of an image segmentation model to obtain a first segmentation mask logits which is output by the mask decoder based on the cross attention theory of each aligned query vector on the semantic feature graph and corresponds to each language query vector; The image segmentation model is trained based on natural language instruction samples, scene image samples and corresponding segmentation mask labels.
2. The image segmentation method based on VLM and language interaction according to claim 1, wherein the cross-modal bridging module comprises a cross-attention layer, a multi-layer perceptron, a modulation module and a fusion module; Inputting each language query vector and the initial feature map into a cross-modal bridging module of an image segmentation model to obtain an aligned query vector with multiple aligned features and multiple semantic feature maps, wherein the aligned query vector is output by the cross-modal bridging module after performing cross-modal operation on the language query vector and the initial feature map, and the multiple semantic feature maps are in one-to-one correspondence with the aligned query vector, and the method comprises the following steps: inputting each language query vector and an initial feature map into the cross attention layer, and executing cross attention operation by using each language query vector as a query and the initial feature map as keys and values by the cross attention layer to obtain each aligned query vector with aligned features; Inputting each alignment inquiry vector into the multi-layer perceptron to obtain each FiLM parameter generated by the multi-layer perceptron based on each alignment inquiry vector; Inputting each FiLM parameter and the initial feature map into a modulation module to obtain each modulated feature map obtained by the modulation module after modulating the initial feature map based on each FiLM parameter; And inputting the modulation feature images corresponding to the aligned query vectors into the fusion module to obtain semantic feature images after the fusion module up-samples the modulation feature images to uniform spatial resolution and weights the semantic feature images after fusion.
3. The image segmentation method based on VLM and language interaction of claim 2, wherein the cross-attention layer is two concatenated first and second cross-attention sublayers; Inputting each language query vector and an initial feature map into the cross attention layer, wherein the cross attention layer uses each language query vector as a query, uses the initial feature map as a key and a value, and executes cross attention operation to obtain each aligned query vector with aligned features, and the method comprises the following steps: inputting each language query vector and an initial feature map into the first cross attention sub-layer, wherein the first cross attention sub-layer uses each language query vector as a query, uses the initial feature map as a key and a value, and executes cross attention operation to obtain each intermediate query vector; And inputting each intermediate query vector and the initial feature map into the second cross attention sub-layer, wherein the second cross attention sub-layer uses each intermediate query vector as a query, uses the initial feature map as a key and a value, and executes cross attention operation to obtain each aligned query vector with aligned features.
4. The method for image segmentation based on VLM and language interactions of claim 1, wherein the mask decoder comprises a shallow Transformer decoder group and a mask generation module; Inputting the semantic feature graphs and the corresponding aligned query vectors into a mask decoder of an image segmentation model to obtain a first segmentation mask logits corresponding to each language query vector, which is output by the mask decoder based on the cross attention theory of each aligned query vector on the semantic feature graph, wherein the first segmentation mask comprises: Inputting each semantic feature map and the corresponding aligned query vector into a shallow layer converter decoder group, performing self-attention operation on each aligned query vector by the shallow layer converter decoder group, and performing cross-attention operation on each aligned query vector and the corresponding semantic feature map to obtain each aligned query vector with refined spatial response; And inputting each aligned query vector and the corresponding semantic feature map after the spatial response is refined into a mask generation module, wherein the mask generation module is used for carrying out pixel-by-pixel dot product operation on each aligned query vector and the corresponding semantic feature map after the spatial response is refined so as to obtain a first segmentation mask logits corresponding to each language query vector.
5. The VLM and language interaction based image segmentation method of claim 4, wherein the mask decoder further comprises an open vocabulary scoring module; after obtaining the first segmentation mask logits corresponding to each language query vector, the method further includes: Calculating the similarity between each aligned query vector after each spatial response refinement and the text vector in the expandable text embedding library to obtain a similarity matching score, wherein the expandable text embedding library is used for storing the names and the attributes of the target categories and the text vectors expressed by the synonyms of the names and the attributes; calculating geometric attribute information of candidate instances corresponding to the first segmentation masks logits in the scene image; converting constraint information of azimuth, quantity and relation contained in the natural language instruction into a geometric constraint rule; The geometric attribute information is selected to satisfy the geometric constraint rule, and a first segmentation mask logits with the highest similarity match score is used to generate a segmentation mask.
6. The VLM and language interaction based image segmentation method of claim 1, further comprising, after deriving a first segmentation mask logits corresponding to each language query vector that the mask decoder outputs to cross-attention-push on the semantic feature graph based on each aligned query vector: And inputting the editing query vector corresponding to the editing language instruction, the first segmentation mask logits and the semantic feature map into an interactive re-segmentation module of the image segmentation model to obtain a second segmentation mask logits after the interactive re-segmentation module converts the editing language instruction into the gating modification and the incremental modification of the first segmentation mask logits.
7. The method for image segmentation based on VLM and language interactions of claim 6, wherein the interactive re-segmentation module comprises a shallow network, a shallow cross-attention header, and a mask update module; Inputting the edit query vector corresponding to the edit class language instruction, the first segmentation mask logits and the semantic feature map into an interactive re-segmentation module of the image segmentation model, to obtain a second segmentation mask logits after the interactive re-segmentation module converts the edit class language instruction into the gating modification and the incremental modification of the first segmentation mask logits, including: inputting the editing query vector, the first segmentation mask logits and the semantic feature map corresponding to the editing class language instruction into the shallow network to obtain a gating map which is output by the shallow network and has the same resolution as the scene image; inputting the editing query vectors and the semantic feature map into the shallow cross attention header, wherein the shallow cross attention header uses each editing query vector as a query, uses the semantic feature map as a key and a value, and executes cross attention operation to obtain a mask increment map; the gating map, the mask delta map, and the first split mask logits are input to a mask update module, which adds the gating map and the mask delta map product to the first split mask logits to obtain the second split mask logits.
8. The image segmentation method based on VLM and language interaction according to any one of claims 1 to 7, wherein the image segmentation model is trained as follows: Inputting the natural language instruction sample and the scene image sample into the image segmentation model to obtain a segmentation prediction mask logits for the image segmentation model prediction; And converting the segmentation prediction mask logits into a segmentation probability mask, substituting the segmentation probability mask and a corresponding segmentation mask label into a loss function, and completing image segmentation model training when the loss function converges.
9. An image segmentation apparatus based on VLM and language interactions, comprising: The language query vector generation unit is used for inputting natural language instructions into a semantic analysis module of the image segmentation model to obtain a plurality of language query vectors which are output by the semantic analysis module and contain instruction intentions; The initial feature map generating unit is used for inputting the scene image corresponding to the natural language instruction into the multi-scale feature extraction module of the image segmentation model to obtain a multi-scale initial feature map output by the feature extraction module; The cross-modal processing unit is used for inputting each language query vector and the initial feature map into a cross-modal bridging module of the image segmentation model to obtain an aligned query vector with aligned multiple features and a plurality of semantic feature maps, wherein the aligned query vector is output after the cross-modal bridging module performs cross-modal operation on the language query vector and the initial feature map, and the semantic feature maps are in one-to-one correspondence with the aligned query vector; A segmentation mask output unit, configured to input a plurality of semantic feature graphs and corresponding alignment query vectors into a mask decoder of an image segmentation model, to obtain a first segmentation mask logits corresponding to each language query vector, where the first segmentation mask is output by the mask decoder based on each alignment query vector to perform cross attention inference on the semantic feature graph; The image segmentation model is trained based on natural language instruction samples, scene image samples and corresponding segmentation mask labels.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor implements the VLM and language interaction based image segmentation method according to any of claims 1-6 when executing the computer program.

Description

Image segmentation method and device based on VLM and language interaction Technical Field The invention relates to the technical field of visual language models, in particular to an image segmentation method and device based on VLM and language interaction. Background With the development of multi-modal large models, visual-Language models (VLMs) are gradually becoming the core foundation for human-computer interaction and environmental understanding. The VLM represented by LLaVA and CLIP can align natural language and visual content in a unified semantic space, and provides general semantic understanding capability for scenes such as quality inspection, medical image analysis, remote sensing interpretation, security monitoring, service/industrial robots and the like. Although the existing open-source VLM (such as LLaVA, variants based on CLIP and the like) has progressed in terms of language understanding and image-level semantic alignment, high-level instructions can be understood and classified according to questions and answers, the existing open-source VLM stays at the questions and answers and coarse-granularity positioning level, or relies on split segmentation pipeline, the language understanding and pixel-level segmentation cannot be organically coupled in the same model and pixel-level masking required by the fine operation (such as grabbing and assembling) of a robot is output, the recognition and separation capability of small targets, low-contrast areas and adhesion targets is weak, and the precision operation support such as grabbing and assembling is insufficient, moreover, the language understanding cannot reversely drive a segmenter to perform secondary segmentation or boundary refinement, and online interaction and closed-loop control are difficult to meet. Disclosure of Invention The invention provides an image segmentation method and device based on VLM and language interaction, which are used for solving the problems that in the prior art, language understanding and pixel level segmentation cannot be organically coupled in the same model, and a pixel level mask of a region to be segmented is output, so that the grabbing and assembling precision is low. The invention provides an image segmentation method based on VLM and language interaction, which comprises the following steps: The invention provides an image segmentation method based on VLM and language interaction, wherein the cross-mode bridging module comprises a cross attention layer, a multi-layer perceptron, a modulation module and a fusion module; Inputting each language query vector and the initial feature map into a cross-modal bridging module of an image segmentation model to obtain an aligned query vector with multiple aligned features and multiple semantic feature maps, wherein the aligned query vector is output by the cross-modal bridging module after performing cross-modal operation on the language query vector and the initial feature map, and the multiple semantic feature maps are in one-to-one correspondence with the aligned query vector, and the method comprises the following steps: inputting each language query vector and an initial feature map into the cross attention layer, and executing cross attention operation by using each language query vector as a query and the initial feature map as keys and values by the cross attention layer to obtain each aligned query vector with aligned features; Inputting each alignment inquiry vector into the multi-layer perceptron to obtain each FiLM parameter generated by the multi-layer perceptron based on each alignment inquiry vector; Inputting each FiLM parameter and the initial feature map into a modulation module to obtain each modulated feature map obtained by the modulation module after modulating the initial feature map based on each FiLM parameter; And inputting the modulation feature images corresponding to the aligned query vectors into the fusion module to obtain semantic feature images after the fusion module up-samples the modulation feature images to uniform spatial resolution and weights the semantic feature images after fusion. According to the image segmentation method based on VLM and language interaction, the cross attention layer is composed of a first cross attention sub-layer and a second cross attention sub-layer which are connected in series; Inputting each language query vector and an initial feature map into the cross attention layer, wherein the cross attention layer uses each language query vector as a query, uses the initial feature map as a key and a value, and executes cross attention operation to obtain each aligned query vector with aligned features, and the method comprises the following steps: inputting each language query vector and an initial feature map into the first cross attention sub-layer, wherein the first cross attention sub-layer uses each language query vector as a query, uses the initial feature map as a key and a value, and executes