CN-120852767-B - Edge enhancement and bottleneck vector-based image segmentation method and system

CN120852767BCN 120852767 BCN120852767 BCN 120852767BCN-120852767-B

Abstract

The invention discloses a method and a system for dividing a reference image based on edge enhancement and bottleneck vectors, which relate to the technical field of reference image division, and the method comprises the steps of training a progressive cross-mode bottleneck fusion network based on a plurality of sample images and text descriptions of each sample image, and embedding and representing the text descriptions of preset sample images through a text encoder in the progressive cross-mode bottleneck fusion network; in the process of multi-stage visual extraction of a preset sample image by a visual encoder, an embedded representation result is introduced into visual features of each visual extraction stage by using a cross-modal attention fusion module, in the process of multi-stage decoding of all visual features by a decoder, local edge enhanced features of a plurality of decoding stages are generated by using a local edge refinement module, and the target image can be accurately referred to as image segmentation by using a trained progressive cross-modal bottleneck fusion network.

Inventors

ZHAO SHULEI
ZHENG XIAOYUN
CHEN JIANCUN
WANG DAFENG
ZHANG YANG
XIA QINQIN
LI BINQI
WANG BEN
ZHENG MING

Assignees

温州电力设计有限公司普华招标咨询分公司

Dates

Publication Date: 20260508
Application Date: 20250625

Claims (7)

1. A method for image segmentation in reference to a bottleneck vector based on edge enhancement, comprising: Training a progressive cross-mode bottleneck fusion network based on a plurality of sample images and text descriptions of each sample image to obtain a trained progressive cross-mode bottleneck fusion network, wherein the progressive cross-mode bottleneck fusion network comprises a text encoder, a visual encoder, a cross-mode attention fusion module, a local edge refinement module and a decoder, and the text descriptions of preset sample images are embedded and represented through the text encoder to obtain an embedded representation result; in the process of multi-stage decoding of all visual features through the decoder, generating local edge enhanced features of a plurality of decoding stages through a local edge refining module, taking the local edge enhanced features of each decoding stage as the input of the next decoding stage until the output of the last 1 decoder is obtained, and generating a designated target segmentation map corresponding to the preset sample image according to the output of the last 1 decoder, wherein the preset sample image is any sample image; Inputting a target image and text description of the target image into the training-completed progressive cross-modal bottleneck fusion network to obtain a designated image segmentation result corresponding to the target image; the embedded representation result comprises vocabulary level embedding, and the visual encoder comprises N network layers which are sequentially arranged; In the process of performing multi-stage visual extraction on the preset sample image by the visual encoder, introducing an embedded representation result into the visual characteristics of each visual extraction stage by using the cross-modal attention fusion module to obtain fused visual characteristics of each visual extraction stage, wherein the method comprises the following steps: After the visual characteristics of the 1 st visual extraction stage are obtained through the 1 st network layer, the 1 st cross-modal attention fusion module fuses the visual characteristics of the vocabulary level embedding and the 1 st visual extraction stage based on an attention mechanism and a bottleneck vector mechanism to obtain the 1 st guide characteristic, extracts the contextual information characteristics in the 1 st guide characteristic based on a gating mechanism, fuses the contextual information characteristics in the 1 st guide characteristic with the visual characteristics of the 1 st visual extraction stage to obtain the fused visual characteristics of the 1 st visual extraction stage, takes the fused visual characteristics of the 1 st visual extraction stage as the input of the 2 nd network layer, and obtains the visual characteristics of the 2 nd visual extraction stage through the 2 nd network layer, so that the 2 nd cross-modal attention fusion module fuses the visual characteristics of the vocabulary level embedding and the 2 nd visual extraction stage based on the attention mechanism and the bottleneck vector mechanism until the fused visual characteristics of each visual extraction stage are obtained; The cross-modal attention fusion module specifically comprises the steps of carrying out cross-modal attention calculation on visual features output by a visual encoder layer and word embedding features output by a text encoder layer by taking the visual features as queries and the word embedding features as keys and values, carrying out preliminary cross-modal feature fusion, obtaining an initial bottleneck vector by utilizing a sentence-level embedding representation output by a first layer visual encoder and a sentence-level embedding representation output by the text encoder through a pixel-level multiplication method and average pooling, carrying out vector splicing on the bottleneck vector of each visual encoding stage and the text features obtained by carrying out cross-attention calculation on the visual features and word embedding respectively, and then carrying out feature transformation through a transducer layer to filter redundant information respectively, carrying out vector segmentation on the two features processed by the transducer layer, carrying out pixel-level multiplication on a visual feature part for merging into a visual encoding of a next layer, and carrying out pixel-level mean value calculation on the bottleneck vector part as a bottleneck vector of the next layer; The local edge refinement module is used for carrying out double-branch convolution operation on the received visual features to generate two intermediate features, processing one intermediate feature and the received output of the decoder based on a residual space attention mechanism to obtain a first processing result, processing the other intermediate feature and the first processing result based on the residual space attention mechanism and a residual channel attention mechanism to obtain a second processing result, determining edge information according to the first processing result and the second processing result, and carrying out self-adaptive fusion on the edge information and the two intermediate features to generate a local edge enhanced feature.
2. The method for image segmentation based on edge enhancement and bottleneck vector according to claim 1, wherein the generating of the local edge enhancement features of the plurality of decoding stages by the local edge refinement module comprises the local edge enhancement features of each decoding stage or the local edge enhancement features of the partial decoding stages; When the local edge enhancement features of the plurality of decoding stages generated by the local edge refinement module include the local edge enhancement features of the partial decoding stages, the local edge enhancement features of the plurality of decoding stages are generated by the local edge refinement module, the local edge enhancement features of each decoding stage are taken as input of the next decoding stage until the output of the last 1 decoders is obtained, comprising: The method comprises the steps of inputting the fusion visual characteristic of an nth visual extraction stage and the guide characteristic generated by an nth cross-modal attention fusion module into a1 st decoder, inputting the output of the 1 st decoder and the guide characteristic generated by an nth-1 st cross-modal attention fusion module into a 2 nd decoder until the nth decoder is reached, inputting the output of the nth decoder and the visual characteristic of the nth-nth visual extraction stage into a1 st local edge refinement module to generate a1 st local edge enhanced characteristic, and inputting the 1 st local edge enhanced characteristic and the guide characteristic generated by the nth-nth cross-modal attention fusion module into an n+1st decoder until the output of the nth decoder is obtained, wherein the nth decoder is the last 1 decoder.
3. A method of image segmentation for reference based on edge enhancement and bottleneck vectors as claimed in any one of claims 1 to 2, characterized by further comprising: And in the process of training the progressive cross-modal bottleneck fusion network, monitoring by adopting cross entropy loss.
4. An image segmentation system based on edge enhancement and bottleneck vectors is characterized by comprising a model training module and an image segmentation module; The model training module is used for training a progressive cross-modal bottleneck fusion network based on a plurality of sample images and text descriptions of each sample image to obtain a trained progressive cross-modal bottleneck fusion network, wherein the progressive cross-modal bottleneck fusion network comprises a text encoder, a visual encoder, a cross-modal attention fusion module, a local edge refinement module and a decoder, and the text descriptions of preset sample images are embedded and represented through the text encoder to obtain an embedded representation result; in the process of multi-stage decoding of all visual features through the decoder, generating local edge enhanced features of a plurality of decoding stages through a local edge refining module, taking the local edge enhanced features of each decoding stage as the input of the next decoding stage until the output of the last 1 decoder is obtained, and generating a designated target segmentation map corresponding to the preset sample image according to the output of the last 1 decoder, wherein the preset sample image is any sample image; The referred image segmentation module is used for inputting a target image and text description of the target image into the training-completed progressive cross-modal bottleneck fusion network to obtain a referred image segmentation result corresponding to the target image; When the visual characteristics of the 1 st visual extraction stage are obtained through the 1 st network layer, the 1 st cross-modal attention fusion module is enabled to fuse the visual characteristics of the 1 st visual extraction stage with the vocabulary level embedding based on an attention mechanism and a bottleneck vector mechanism to obtain the 1 st guide characteristic, the contextual information characteristics in the 1 st guide characteristic are extracted based on a gating mechanism, the contextual information characteristics in the 1 st guide characteristic are fused with the visual characteristics of the 1 st visual extraction stage to obtain fused visual characteristics of the 1 st visual extraction stage, the fused visual characteristics of the 1 st visual extraction stage are used as the input of the 2 nd network layer, the 2 nd cross-modal attention fusion module is enabled to fuse the visual characteristics of the 1 st visual extraction stage with the visual characteristics of the 2 nd visual extraction stage based on the attention mechanism and the bottleneck vector mechanism until the fused visual characteristics of each visual extraction stage are obtained; The cross-modal attention fusion module specifically comprises the steps of carrying out cross-modal attention calculation on visual features output by a visual encoder layer and word embedding features output by a text encoder layer by taking the visual features as queries and the word embedding features as keys and values, carrying out preliminary cross-modal feature fusion, obtaining an initial bottleneck vector by utilizing a sentence-level embedding representation output by a first layer visual encoder and a sentence-level embedding representation output by the text encoder through a pixel-level multiplication method and average pooling, carrying out vector splicing on the bottleneck vector of each visual encoding stage and the text features obtained by carrying out cross-attention calculation on the visual features and word embedding respectively, and then carrying out feature transformation through a transducer layer to filter redundant information respectively, carrying out vector segmentation on the two features processed by the transducer layer, carrying out pixel-level multiplication on a visual feature part for merging into a visual encoding of a next layer, and carrying out pixel-level mean value calculation on the bottleneck vector part as a bottleneck vector of the next layer; The local edge refinement module is used for carrying out double-branch convolution operation on the received visual features to generate two intermediate features, processing one intermediate feature and the received output of the decoder based on a residual space attention mechanism to obtain a first processing result, processing the other intermediate feature and the first processing result based on the residual space attention mechanism and a residual channel attention mechanism to obtain a second processing result, determining edge information according to the first processing result and the second processing result, and carrying out self-adaptive fusion on the edge information and the two intermediate features to generate a local edge enhanced feature.
5. The image segmentation system according to claim 4, wherein the local edge enhancement feature generation module generates the local edge enhancement feature of the plurality of decoding stages, the local edge enhancement feature of each decoding stage or the local edge enhancement feature of the partial decoding stage, when the local edge enhancement feature generation module generates the local edge enhancement feature of the plurality of decoding stages, the fused visual feature of the nth visual extraction stage and the guide feature generated by the nth cross-modal attention fusion module are input to the 1 st decoder, the output of the 1 st decoder and the guide feature generated by the nth-1 cross-modal attention fusion module are input to the 2 nd decoder until the nth decoder is reached, the output of the nth decoder and the visual feature of the nth-N visual extraction stage are input to the 1 st local edge refinement module, the 1 st local edge enhancement feature is generated, the 1 st local edge enhancement feature and the guide feature generated by the nth-N cross-modal attention fusion module are input to the 1 st decoder, and the n+1st decoder is input to the nth decoder until the nth decoder is reached, and the n+1st decoder is finally obtained.
6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a method of edge enhancement and bottleneck vector-based image segmentation as claimed in any one of claims 1 to 3 when the computer program is executed.
7. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements a method of edge enhancement and bottleneck vector based image segmentation as claimed in any one of claims 1 to 3.

Description

Edge enhancement and bottleneck vector-based image segmentation method and system Technical Field The invention relates to the technical field of image segmentation, in particular to a method and a system for image segmentation based on edge enhancement and bottleneck vectors. Background In recent years, with the breakthrough progress of deep learning and intelligent perception technologies, related researches of image refinement and analysis become hot spots. The technology has a continuous rising practical demand in the fields of medical image diagnosis, automatic driving, industrial quality inspection, remote sensing ground object analysis and the like. The task of image Segmentation (REFERRING IMAGE RIS) is one of the key tasks in the field of computer vision, and the core objective is to realize pixel-level region Segmentation and region accurate classification by integrating multi-modal features, context semantics and object boundary sensitive information. The technology needs to effectively distinguish a foreground object and a complex background while maintaining the original spatial structure of an image, and accurately positions and segments entities (such as biological organs, road barriers, industrial defects or ground covers) with specific semantics in the image from the complex scene image with pixel-level precision. These segmentation results contain both the specific contour morphology of the object and also carry rich semantic hierarchy information. Therefore, the image segmentation technology is referred to as a key visual processing module, and important structural data support is provided for subsequent researches such as target recognition, three-dimensional reconstruction, scene semantic understanding, intelligent decision and the like. According to different model architectures, the existing image segmentation technology is mainly divided into: 1) The two-stage segmentation technology comprises the steps of firstly positioning a boundary box of a designated target according to text description, obtaining a boundary positioning box of the target, and then further segmenting the target in the box. However, this approach is too cumbersome to implement and the segmentation performance is severely dependent on whether the bounding box is well framed to the pointing object. 2) The end-to-end segmentation technique can directly learn features from the original input data and then output segmentation masks, and does not require manual design of a feature extractor, thereby effectively improving efficiency. The early method adopts a CNN-LSTM architecture, uses a convolutional neural network (Convolutional Neural Network, CNN) and a Long Short-Term Memory (LSTM) to extract the features of two modes respectively, and then performs feature alignment on the obtained feature vectors of the two modes through a multi-mode feature alignment module to realize semantic consistency. After the characteristics are aligned, final segmentation result prediction is carried out on the aligned characteristics through the existing semantic segmentation network. However, the limitations of CNN and LSTM determine that the architecture is difficult to handle long-range interactions between semantic entities in the modality, and accurate segmentation results are obtained. In addition, the multi-modal feature fusion strategy in the end-to-end model plays a key role in the accuracy of results, and all targets cannot be finely divided when the traditional modes of splicing, dot multiplication and the like on the deepest multi-modal features face the reference description of multi-scale and complex boundaries. The attention-based network framework can selectively fuse the effective information among the modalities to improve the segmentation effect, but the current public data sets lack corresponding attention labeling information, and correct attention weights cannot be learned. The fusion mode based on the Transformer can realize long-range interaction between semantic entities in the mode, but cannot flexibly process a large number of complex segmentation scenes, and a large amount of irrelevant information is easily introduced to cause suboptimal performance, so that the current technology is difficult to achieve an ideal detection effect. Disclosure of Invention The invention aims to solve the technical problem of overcoming the defects of the prior art, and particularly provides a method and a system for dividing a designated image based on edge enhancement and bottleneck vectors, wherein the method comprises the following steps: 1) In a first aspect, the present invention provides a method for image segmentation based on edge enhancement and bottleneck vectors, and the specific technical scheme is as follows: Training a progressive cross-mode bottleneck fusion network based on a plurality of sample images and text descriptions of each sample image to obtain a trained progressive cross-mode bottleneck fusion netwo