CN-121982508-A - Underwater target detection method based on differential attention

CN121982508ACN 121982508 ACN121982508 ACN 121982508ACN-121982508-A

Abstract

The invention discloses an underwater target detection method based on differential attention, which comprises the following steps of obtaining an underwater image to be detected, constructing a light-weight underwater target detection network model based on a differential attention mechanism based on an underwater image dataset, constructing a joint loss function, performing end-to-end training on the light-weight underwater target detection network model to obtain a trained light-weight underwater target detection network model, performing mathematical equivalent fusion on the light-weight underwater target detection network model to generate a light-weight inference model, inputting the obtained underwater image to be detected into the light-weight inference model, and realizing target detection on the underwater image. According to the invention, the differential computing is utilized to strip common mode noise through introducing a differential attention feature interaction mechanism, and the detection precision and instantaneity of the underwater fuzzy scene and the tiny target are obviously improved by combining a feature pyramid network with the integration of dynamic scale perception and semantic gating.

Inventors

YUAN GUOLIANG
LI JUNCHI
YAO JUNHAO
CHEN HONGMING
FU XIANPING
WANG JIE

Assignees

大连海事大学

Dates

Publication Date: 20260505
Application Date: 20260206

Claims (7)

1. The method for detecting the underwater target based on the differential attention is characterized by comprising the following steps of: Acquiring an underwater image to be detected; constructing a light underwater target detection network model based on a differential attention mechanism based on the underwater image dataset; Constructing a joint loss function, and performing end-to-end training on the lightweight underwater target detection network model to obtain a trained lightweight underwater target detection network model; carrying out mathematical equivalent fusion on the light-weight underwater target detection network model to generate a light-weight reasoning model; And inputting the acquired underwater image to be detected into a generated lightweight inference model to realize target detection of the underwater image.
2. The method for detecting an underwater target based on differential attention as set forth in claim 1, wherein the lightweight underwater target detection network model comprises The backbone network is used for extracting a multi-scale feature map based on an original image of an underwater image to be detected and adopting a HGNetv architecture based on structural weight parameterization improvement, and outputting three feature maps with different scales, wherein the three feature maps comprise a first feature map, a second feature map and a third feature map; The encoder is used for receiving the first feature map, the second feature map and the third feature map which are output by the backbone network, executing feature enhancement and multi-scale feature fusion processing on three different-scale feature maps, establishing cross-level context aggregation and semantic alignment among different-scale features while suppressing underwater environment noise interference, and finally outputting the enhanced multi-scale fusion features; And the decoder is used for receiving the fused multi-scale characteristics transmitted by the encoder and outputting the category and bounding box position information of the underwater target through the prediction head.
3. The method for detecting an underwater target based on differential attention according to claim 1, wherein said encoder comprises: The anti-noise feature coding network is used for receiving the first feature map output by the backbone network, performing signal and noise separation processing on the features by utilizing a differential attention mechanism, and outputting the denoised first feature map; the AD-FPN module is used for receiving the denoised first feature map transmitted by the anti-noise feature coding network, the denoised second feature map and the denoised third feature map transmitted by the backbone network, constructing a feature pyramid based on an aggregation-diffusion framework, performing two aggregation and diffusion operations on the input multi-scale features, repeatedly extracting global context information and distributing the global context information to different scale branches to realize cross-scale semantic alignment and feature enhancement, and dynamically screening effective information in a feature fusion process by utilizing a semantic guided gating mechanism, inhibiting accumulation of background noise in cross-level fusion and finally outputting fusion features with scale robustness and high signal to noise ratio.
4. The method for detecting the underwater target based on the differential attention according to claim 1, wherein the backbone network comprises a characteristic module and a structural re-parameterization convolution module; the characteristic module comprises a Stem module, a first HG-Stage module, a second HG-Stage module, a third HG-Stage module and a fourth HG-Stage module which are connected in sequence; The first HG-Stage module comprises an HG-Block feature extraction unit; The second HG-Stage module and the fourth HG-Stage module comprise a downsampling unit and an HG-Block characteristic extraction unit which are sequentially connected; The third HG-Stage module comprises a downsampling unit and two HG-Block characteristic extraction units which are sequentially connected; The downsampling unit is used for reducing the spatial resolution of the feature map, and the HG-Block feature extraction unit is used for extracting and integrating deep semantic features; The HG-Block feature extraction unit consists of a plurality of stacked structural re-parameterized convolution modules; The structural reparameterization convolution module is configured to construct multiple parallel branches by adopting different topological structures and calculation logics according to the stages, and specifically comprises the following steps: In the model training stage, constructing a parallel branch structure comprising a3 multiplied by 3 convolution layer, a1 multiplied by 1 convolution layer and identical mapping branches; In the model reasoning stage, the identity mapping branches are firstly converted into 1X 1 convolution kernels by utilizing the linear additivity principle of convolution operation, zeros are filled around the 1X 1 convolution kernels to be converted into 3X 3 convolution kernels, finally element-level addition is carried out on convolution kernel weight matrixes and offset vectors of all branches to generate unique 3X 3 convolution weights and offsets, in the model reasoning stage, a one-way convolution structure is built by utilizing the unique 3X 3 convolution weights and offsets, and an input feature map is processed and output.
5. The method for detecting the underwater target based on the differential attention as set forth in claim 1, wherein the anti-noise feature encoding network comprises a differential attention feature interaction module, a channel perception differential attention mechanism is introduced, the characteristics of a band-pass filter are simulated by utilizing differential calculation, the low-frequency common mode background noise in a feature map is automatically counteracted by calculating the difference value of two groups of attention weights, the high-frequency differential signal of the edge of the target is enhanced, and the encoding feature with high signal-to-noise ratio is output.
6. The method for detecting the underwater target based on the differential attention as claimed in claim 1, wherein the AD-FPN module comprises a cavity selective aggregation module and a semantic gating fusion module: Adopting a parallel three-branch structure based on an aggregation-diffusion framework, taking a branch where a second feature map is located as an aggregation center, extracting and integrating global context information by utilizing the cavity selective aggregation module, diffusing the aggregated features to the branch where a first feature map is located and the branch where a third feature map is located through up-sampling and down-sampling operations; The cavity selective aggregation module is configured to construct a dynamic multi-scale perception system, and comprises a plurality of parallel processing branches, wherein one branch is an identical mapping branch, and the other branches are cavity convolution branches with different expansion rates and are used for capturing context information of different receptive fields from local to global; The semantic gating fusion module is configured to execute anti-noise feature alignment, by receiving semantic features from deep layers as guiding features and features to be processed from shallow layers, firstly, carrying out convolution operation and Sigmoid activation processing on the semantic features to generate a spatial gating weight map with a value range of 0 to 1, then carrying out element level multiplication operation on the weight map and the features to be processed to inhibit background noise, finally, carrying out residual addition on an operation result and the original features to be processed to generate denoised shallow layer features, then carrying out channel splicing on the denoised shallow layer features and the deep semantic features, and outputting the fused features.
7. The method for detecting the underwater target based on the differential attention as set forth in claim 5, wherein the anti-noise feature encoding network realizes accurate separation of signals and noise through a differential attention mechanism, and the implementation process is as follows: step 1-1, inputting a first feature map extracted from a backbone network, wherein the first feature map contains rich information in an image and also contains noise caused by an underwater imaging environment after preliminary processing; Step 1-2-logically dividing the attention head into two complementary subspaces H 1 and H 2 ;H 1 for preserving the target signal, H 2 for acting as a noise estimator, aiming to strip back-scattered noise by a subtraction operation, after introducing spatial position coding to enhance the geometric perception, The differential attention feature interaction module calculates the attention responses of the two groups respectively, suppresses the common mode background by using the differential characteristic, gives the query Q, the key K and the value V, The attention output of H 1 was defined as O s ,H 2 and the corresponding attention output was defined as O n ; dk is the dimension of the key, softmax operates to calculate the attention score, focusing on the relevant area; introducing a learnable channel attenuation vector Differential fusion is performed, and the final differential attention output O d is defined as: Wherein the method comprises the steps of The Hadamard product representing the channel dimension, lambda is a learnable channel attenuation vector for adjusting the noise suppression intensity, the value of the Hadamard product is limited in a section [0.05,0.95] to ensure the stability of the gradient, and the suppression intensity of the background noise in O n can be adaptively adjusted according to the signal-to-noise ratios of different characteristic channels such as colors, textures and the like, so that the high-frequency target details in O s are highlighted; Step 1-3, generating pure target features through a differential attention mechanism, adopting a residual structure by a DAFI module to ensure the feature reconstruction capability of a deep network, carrying out residual connection and layer normalization on the output of a differential attention feature interaction module, and then sending the output of the differential attention feature interaction module into a feedforward neural network to carry out nonlinear transformation, wherein the calculation flow of the whole DAFI module is formed into the following form: 。

Description

Underwater target detection method based on differential attention Technical Field The invention belongs to the technical field of target detection, and particularly relates to an underwater target detection method based on differential attention. Background The underwater target detection is a core sensing technology for constructing a submarine observation network and realizing autonomous operation of an underwater robot (AUV/ROV) and marine ecological monitoring. However, unlike the terrestrial atmospheric environment, water is a complex optically heterogeneous medium. In the underwater transmission process, the light wave is influenced by the wavelength selective absorption of water molecules, the image shows serious bluish green color cast and brightness attenuation, and meanwhile, the backscattering effect of suspended particles in water on light rays generates a curtain effect, so that the imaging is blurred, the contrast is extremely low and high-frequency background noise is enriched. This severe "domain shift" phenomenon makes the generic detection model trained on land scenes face a huge feature extraction barrier in underwater scenes. Existing mainstream target detection algorithms (such as CNN-based YOLO series or standard Transformer architecture) have significant limitations in processing such degraded images: the traditional convolution network is limited by local receptive field, global information is difficult to distinguish target and noise under turbid background, while a detector based on a transducer has strong global context capturing capability, but in a high-scattering underwater environment, the global modeling generates negative effects, high-energy backward scattering noise dominates similarity calculation, so that attention is spread to the background, the target signal is difficult to focus, and attention is scattered and false alarm frequently occurs. Multiscale fusion failure existing feature fusion networks (such as PANet, FPN) typically rely on static scale priors and simple linear superposition or stitching for cross-level interactions. However, in complex underwater scenes, the mechanism faces double challenges of 1. Extreme scale variation of refraction aggravated by shooting distance, object type and underwater complex environment, detection objects in underwater images show extreme scale variation, and refraction effects of underwater light can cause distortion of the objects to aggravate the scale variation, so that receptive fields fixed by the existing network cannot dynamically adapt to the extreme multi-scale characteristics, and the perceptibility of tiny or distorted objects is insufficient. 2. Cross-level noise accumulation and contamination shallow features of underwater images, while containing geometric details, are also immune to high energy backscattering noise. Existing 'rough' fusion easily causes shallow high-frequency noise to be erroneously amplified in trans-scale transmission and pollute deep semantic features, so that characteristic signals of weak targets are annihilated in hierarchical transmission. Furthermore, to cope with severe imaging conditions, existing studies tend to improve accuracy by stacking deep networks or introducing complex image enhancement preprocessing modules. However, this strategy causes a rapid increase in model parameters and calculation costs, not only introduces additional reasoning delay, but also is more difficult to adapt to the resource-constrained embedded computing platform (such as Jetson series) carried by the underwater robot, and cannot meet the severe requirements of underwater autonomous operation on high precision, low delay and low power consumption. Therefore, the research and development of the target detection method which can resist underwater noise interference from an algorithm bottom layer mechanism, efficiently fuse multi-scale features and has the light weight characteristic is a key technical problem to be solved urgently in the current ocean engineering field. With the deep exploration and development of ocean, the underwater target detection has become a core support technology of key tasks such as submarine resource exploration, underwater engineering facility inspection, ocean ecological monitoring and the like. However, limited by the special physical mechanism of underwater optical imaging, the light wave can generate significant wavelength selective absorption and non-uniform scattering effects in the water medium transmission process, so that the acquired underwater source image inevitably has serious color shift, contrast degradation and detail blurring. In particular, in turbid water, the backward scattering noise generated by the suspended particles is aliased with the foreground target signal, so that the signal-to-noise ratio of the image is greatly compressed. In the face of such severely degraded visual input, the existing general object detection model often faces the double dilem