CN-121999211-A - Underwater organism segmentation method based on scene style generalization and dynamic space attention fusion of improvement SegFormer

CN121999211ACN 121999211 ACN121999211 ACN 121999211ACN-121999211-A

Abstract

The invention belongs to the field of underwater biological image segmentation, and provides an underwater biological segmentation method based on scene style generalization and dynamic spatial attention fusion of improvement SegFormer. The model constructed is based on SegFormer and incorporates a multi-branch dynamic spatial attention module, an improved style blending module, and a detail-preserving context feature fusion module. Under the unfavorable environment of underwater darkness and complexity, the model provided by the invention obviously improves the segmentation precision of the target boundary and shows more excellent segmentation performance under the complex scene. The model effectively solves the problems of poor performance and rough segmentation boundary of the existing semantic segmentation model in the underwater scene, and provides important value for research of semantic segmentation of the underwater target image.

Inventors

ZHOU ZHIYU
LI ZHENG

Assignees

浙江理工大学

Dates

Publication Date: 20260508
Application Date: 20251231

Claims (6)

1. The underwater organism segmentation method based on the scene style generalization and dynamic space attention fusion of the improvement SegFormer is characterized by comprising the following steps of: Step 1, acquiring an underwater target image; step 2, constructing an underwater target feature extraction model based on a backbone network of Segformer networks to perform feature extraction: Inserting a multi-branch dynamic spatial attention module after the output of each Stage of Encoder MiT of the Segformer network, inserting a modified style mix module into the first two layers of Encoder MiT of the Segformer network; the multi-branch dynamic spatial attention module (MH-DDSA) comprises three parallel branches, namely a multi-head deformable attention branch, a spatial frequency domain approximate branch and a depth convolution enhancement branch, which are respectively used for capturing dynamic spatial attention, frequency domain characteristics and local details; the multi-branch dynamic space attention module takes the characteristics output by the Stage as an input characteristic diagram; The self-adaptive fusion is carried out after three parallel branches are processed: The input feature map enters a multi-head deformable attention branch, and a dynamic offset sampling point is generated for each pixel point by utilizing a convolution layer to learn offset, so that a model can be adaptively focused on a key space region of an underwater image; then constructing a multiscale receptive field by pyramid pooling, aggregating global context information, and generating a low-rank dynamic convolution kernel based on the aggregated global context information; then, the low-rank dynamic convolution check is utilized to carry out weighted aggregation on the features extracted based on the dynamic offset sampling points, a space attention diagram is generated, and the space attention diagram is multiplied by the input feature diagram, so that the dynamic emphasis on key features and the suppression on noise are realized; The space frequency domain approximation branch adopts a space domain convolution approximation strategy to the input feature map, the high-frequency feature and the low-frequency feature of the input feature map are directly separated in the space domain by utilizing the convolution kernel initialized to the Laplacian operator and the mean operator, The high-frequency features are used for strengthening edges and textures, and the low-frequency features are used for capturing the whole structure and the background; The high-frequency and low-frequency characteristics are spliced and then are fused through 1x1 convolution, so that complex computing overhead is avoided, and meanwhile, loss of high-frequency details caused by underwater scattering is supplemented; The depth convolution enhancement branch uses depth separable convolution to carry out light weight processing on an input feature map, and a convolution kernel of 3x3 is utilized to efficiently capture local space details so as to reduce the quantity of parameters and enhance the expression capability of a model on the surface texture of the underwater organism; The method comprises the steps that output features of a multi-head deformable attention branch, a spatial frequency domain approximate branch and a depth convolution enhancement branch are sent to a gating unit for self-adaptive fusion to obtain a fused feature map, wherein the gating unit generates dynamic weights based on semantic content of each pixel position, so that each pixel can autonomously select an enhancement mode; The fused feature map is subjected to channel attention enhancement through an SE (sequential-and-expression) module, namely, the space dimension is compressed through global average pooling to generate channel weight, and then the channel importance is recalibrated through a full-connection layer and an activation function to improve the robustness of the model to the color distortion and illumination change of the underwater image; The improved style mixing module is based on MixStyle module, introduces a sample pairing strategy of semantic perception, ensures that mixing operation is carried out between similar semantic features, and adopts channel-level mixing strength Allowing each channel to independently adjust the mixing ratio; The improved style mixing module takes the feature images from the first two layers of Encoder MiT as input feature images, firstly applies semantic perception pairing to the input feature images, calculates sample similarity in batches and selects optimal pairing; step 3, feature fusion Replacing All-MLP Decoder of SegFormer network with detail-keeping context feature fusion module; Constructing a cascading detail-preserving decoding network based on 3 detail-preserving context feature fusion modules, and adopting a Top-Down Top-Down path, wherein each level of fusion uses the detail-preserving context feature fusion module, and utilizes high-level features as context guides to dynamically screen effective textures in low-level features and inhibit noise; The detail-preserving context feature fusion module comprises a packet adaptive combiner (Grouped Adaptive Combiner) containing a global semantic path and a local detail path; The input of the global semantic path is a high-level feature Generating a global weight vector through AdaptiveAvgPool (1) +MLP (media slice) for adjusting the importance of different channels; the input of the local detail path is a high-level feature And low-level features Is spliced Concat% , ) Extracting edge information through convolution of 3x3, BN, reLU, and generating a spatial attention map Spatial Attention Map at a pixel level; The fusion weight A of the packet adaptive combiner is obtained by adding the two paths and then carrying out Sigmoid: ; is a weight term of the global semantic path, Is a weight term for the local detail path, Based on the fusion weight A, the output characteristics of the packet self-adaptive combiner are obtained ; And 4, acquiring an underwater biological picture to be segmented, and inputting an underwater target segmentation model to obtain an underwater segmentation result.
2. The underwater biological segmentation method based on the scene style generalization and the dynamic spatial attention fusion of the improvement SegFormer as claimed in claim 1, In step 2, a sample pairing strategy of semantic perception is introduced into the improved style mixing module to ensure that the mixing operation is performed between similar semantic features, and channel-level mixing intensity is adopted Allowing each channel to independently adjust the mixing ratio specifically includes: In the improved style mixing module, the random selection sample strategy of MixStyle module is changed into that each sample is firstly extracted by a GAP method Semantic representation of (a) : ; A cosine similarity matrix is calculated for semantic representations of all samples within the batch: ; And Semantic representations representing two different samples, respectively; Representative of And Cosine similarity matrix of (a); then another sample that is most similar is selected for each sample based on the cosine similarity matrix: ; The improved style mixing module generates a lambda matrix of [ C, 1, 1] shape according to the channel dimension C of the input feature diagram, and is used for independently controlling the mixing proportion of each channel.
3. The underwater biological segmentation method based on the scene style generalization and the dynamic spatial attention fusion of the improvement SegFormer as claimed in claim 1, In the step 3, the Top-Down Top-Down path is specifically from Stage 4 to Stage 3 to Stage 2 to Stage 1; For the current layer StageN, the high-level features refer to output features from the encoder's deeper level, stageN +1 layer, or the higher level detail preserving context feature fusion module, and the low-level features are output features from the encoder's shallower level, stage N layer.
4. The underwater biological segmentation method based on the scene style generalization and the dynamic spatial attention fusion of the improvement SegFormer as claimed in claim 1, In the step 1, the types of the underwater targets comprise sea cucumbers, sea urchins, scallops and starfish.
5. The underwater biological segmentation method based on the scene style generalization and the dynamic spatial attention fusion of the improvement SegFormer as claimed in claim 2, In the step 1, the method further comprises the steps of setting the length and width of the underwater target image to be 512 multiplied by 512, and mapping the underwater target image under a normal distribution function through regularization treatment.
6. The underwater biological segmentation method based on the scene style generalization and dynamic spatial attention fusion of the improvement SegFormer as set forth in claim 1, wherein the output characteristics of the packet adaptive combiner are obtained based on the fusion weight a The method specifically comprises the following steps: Output characteristics The expression of (2) is as follows: ; For background noise regions with suspended particles, the weighting term of the local detail path Positive values due to capturing texture of suspended particles, weight term of global semantic path Is negative by determining that the region is a background lacking semantic support, and + The weight A mapped by the Sigmoid function tends to be 0; For a biological contour region, as a target edge region, the weighting term of the local detail path Positive values due to capturing significant gradient changes, weighting terms of global semantic paths The weight a mapped by the Sigmoid function tends to be 1, which is also positive because it is determined that the region has an object semantic boundary.

Description

Underwater organism segmentation method based on scene style generalization and dynamic space attention fusion of improvement SegFormer Technical Field The invention belongs to the field of underwater biological image segmentation, and particularly relates to an underwater biological segmentation method based on scene style generalization and dynamic spatial attention fusion of improvement SegFormer. Background The underwater semantic segmentation is taken as an important component of underwater visual perception, and has irreplaceable functions in the tasks of marine geological investigation, marine pasture monitoring, underwater robot navigation, operation and the like. The imaging process of underwater images is affected by complex optical media, the visual appearance of which has significant field characteristics, compared to land natural images. Light rays can be obviously scattered and absorbed in a water body, so that problems such as brightness attenuation, color distortion and outline blurring occur in a remote area. Meanwhile, a large amount of particles, plankton, etc. floating in the sea water generate random specks called "sea snow", so that the image noise level is remarkably increased. The complex factors together cause the problems of unstable underwater image quality, uneven illumination distribution and serious texture degradation, and the natural image semantic segmentation model is difficult to directly migrate and maintain stable performance. To address the challenges presented by underwater imaging degradation, researchers have explored a lot around enhancing feature expression, multi-scale context aggregation, and robustness to noise. The lightweight structure CSPLITENET proposed by Shi et al enables the network to more accurately identify target boundaries in weak texture regions by enhancing multi-scale feature extraction capabilities. Li et al construct a multiscale receptive field using dilation convolution, enhancing the network's perceptibility to complex backgrounds by different expansion ratio combinations. Wang et al design USNet, through the quality of multiscale water network module promotion underwater image, alleviate illumination decay and fuzzy problem from the feature level. DINet proposed by Ge adopts a double-iteration enhancement mechanism to carry out explicit modeling on degradation factors, so that the network is more suitable for a real underwater environment. Meanwhile, the improved DeepLabV3+ based structure of Liu et al allows the edge profile of the marine target to be effectively preserved by introducing a depth separable convolution and enhanced multi-scale fusion strategy. In a complex underwater scene, the target contour is more severely interfered by noise, so that noise immunity and detail recovery become important research points. RUSNet proposed by Zhang relieves the influence of motion optical flow noise on segmentation through adaptive filtering, and improves the robustness on a fast target. In addition, noise in sonar or low-definition images is reduced through non-local mean filtering and superpixel aggregation, and feature consistency is improved. The Transformer architecture has also been introduced in recent years into the field of underwater semantic segmentation, such as Pavithra et al, which uses Swin Transformer to construct a segmentation framework with stronger global modeling capability, and shows better structural feature extraction capability in low-light scenes. The Graph Neural Network (GNN) also starts to be used for modeling the correlation information between underwater targets, and has good effects in scenes with complex structures or finely divided targets. Although the existing method has a certain progress in the aspects of feature enhancement, multi-scale information fusion and noise resistance, the problems of uncontrollable illumination environment, weakened target boundary, missing texture information and the like of the underwater image still exist, so that the defects of blurred edges, missing of small targets and the like of a separation result often occur. Therefore, the model which can effectively cope with challenges such as blurring, color cast, detail loss and the like of the underwater image is constructed, the accuracy and the robustness of the underwater scene semantic segmentation are obviously improved, and the model has important significance for promoting the development of the underwater semantic segmentation technology. Disclosure of Invention The invention discloses an image semantic segmentation network for an underwater scene, which is improved based on SegFormer architecture. The network core comprises a multi-branch dynamic space attention module, an improved style mixing module and a detail-keeping context feature fusion module, wherein the multi-branch dynamic space attention module is used for enhancing focusing capability and texture feature extraction capability of a model to a key space region