Search

CN-122023188-A - Unpaired underwater image enhancement method based on structure perception multi-mode fusion

CN122023188ACN 122023188 ACN122023188 ACN 122023188ACN-122023188-A

Abstract

The invention discloses a non-paired underwater image enhancement method based on structure perception multi-mode fusion, and belongs to the technical field of underwater image processing. The method comprises the steps of constructing a generator network based on a U-shaped transducer, training the generator network by utilizing a non-paired data set, adopting the joint constraint of generation counterloss and multi-scale Patch NCE contrast loss, carrying out channel splicing on an underwater RGB image to be processed and a spatial depth map and a structure prior map, constructing a multi-mode input tensor, and then feeding the multi-mode input tensor into the trained generator network to output an enhanced underwater image through forward propagation. According to the invention, the perception capability of the network to the geometric outline is enhanced by introducing the depth map and the structure priori map, and the high-fidelity restoration of the high-turbidity underwater image is realized under the condition of no paired data by combining the global modeling and the unpaired contrast learning of the transducer, so that the method can be widely applied to the fields of marine exploration, underwater monitoring and the like.

Inventors

  • KONG XIANGFENG
  • ZHANG ZHICHENG
  • LIU FENGQING
  • LIU YAN
  • MA RAN
  • JIANG QINGLIN
  • Yi Yuanquan

Assignees

  • 齐鲁工业大学(山东省科学院)
  • 山东省科学院海洋仪器仪表研究所

Dates

Publication Date
20260512
Application Date
20260414

Claims (10)

  1. 1. The unpaired underwater image enhancement method based on the structure perception multi-mode fusion is characterized by comprising the following steps of: The method comprises the steps of S1, constructing a generator network based on a U-shaped transducer, wherein the generator network comprises an encoder, a bottleneck layer and a decoder, the encoder is used for receiving multi-mode input tensors, carrying out layering coding through a multi-head self-attention mechanism based on windows and a local enhancement feedforward network, extracting multi-scale deep features, and the decoder is used for carrying out layer-by-layer up sampling on the deep features and splicing the same-scale high-frequency features from the encoder by utilizing jump connection to reconstruct an enhanced underwater image; S2, training the generator network by utilizing a pre-constructed unpaired data set, and carrying out joint constraint optimization on the generated countermeasures and the multi-scale Patch NCE contrast losses sampled by the multi-level characteristic nodes in the training process until the model converges to obtain a trained generator network; S3, acquiring an underwater RGB image to be processed, and calling a corresponding spatial depth map and a structure prior map, and splicing and fusing three modal data of the RGB image, the spatial depth map and the structure prior map in a channel dimension to construct a multi-modal input tensor; S4, feeding the multi-mode input tensor into a trained generator network, and outputting the enhanced underwater image after forward propagation.
  2. 2. The method for enhancing unpaired underwater images based on structural perception multi-mode fusion according to claim 1, wherein after feature extraction is completed by an encoder of the generator network, space downsampling and channel dimension expansion are performed by adopting stride convolution with a step length of 2, the space resolution of a feature map is reduced by half smoothly, and the number of channels is doubled to construct multi-scale level features.
  3. 3. The method for enhancing unpaired underwater images based on structure-aware multi-modal fusion according to claim 1, wherein the window-based multi-head self-attention mechanism introduces a learnable relative position bias B in calculating attention, and the attention calculation expression is: ; Q, K, V is a query matrix, a key matrix and a value matrix respectively, and Softmax is a normalized exponential function; Is the channel dimension of a single attention header, and T is the matrix transpose.
  4. 4. The method for enhancing unpaired underwater images based on structural perception multi-mode fusion according to claim 1, wherein the calculation logic of the local enhancement feedforward network is that after channel expansion, features are remodeled into two-dimensional tensors, local edge details are captured by utilizing 3 x 3 depth separable convolution of a spatial domain, and then channel weights are adaptively recalibrated and output through an efficient channel attention mechanism, and the transformation expression is as follows: ; ; ; Wherein, the The characteristics of the intermediate hidden layer after the linear mapping in the first step are obtained; z is a one-dimensional multi-modal feature sequence input after being processed by a multi-head self-attention mechanism based on windows; A first layer of linear full-connection weight matrix for channel expansion; representing intermediate features extracted by the local structure; A3 x 3 depth separable convolution operation for capturing local structure details; Representing 2-dimensional remodeling; ECA represents an efficient channel attention function and is used for carrying out self-adaptive recalibration on the fused characteristics; representing re-flattening the two-dimensional features into a one-dimensional sequence; A second layer linear full-connected weight matrix for channel dimension reduction.
  5. 5. The method for enhancing the unpaired underwater image based on the structure-aware multi-mode fusion according to claim 1, wherein an input projection module is further arranged in front of an encoder and used for performing cross-mode feature aggregation and high-dimensional mapping on multi-mode input tensors, and the input projection module adopts LeakyReLU activation functions with a combination negative slope of a 3×3 two-dimensional convolution layer without bias terms set to be 0.2.
  6. 6. The method for enhancing unpaired underwater images based on structural-aware multi-modal fusion according to claim 1, wherein the layer-by-layer upsampling of the decoder is achieved by bilinear interpolation, and the upsampled features are merged with co-scale features from the encoder to restore the spatial resolution of the images.
  7. 7. A method of unpaired underwater image enhancement based on structure-aware multimodal fusion as claimed in claim 1, wherein the generated challenge losses are constructed in the form of least squares generated challenge network with generator loss functions And a discriminator loss function Respectively defined as: ; ; Wherein, the The method comprises the steps of calculating mathematical expectation, wherein Y-Y represents a reference sample Y sampled from a real image domain Y, and D is a confidence score output by a discriminator; representing a source image X sampled from a turbid underwater image field X; An enhanced image output by the generator.
  8. 8. The method for enhancing unpaired underwater images based on structure-aware multi-modal fusion according to claim 1, wherein the multi-scale Patch NCE contrast loss calculation process comprises: (1) Defining a set of multi-scale features in a generator network that participate in a federated contrast space , , wherein, For the initial characteristics of the input projection module output, As a feature of the shallow layer, As a feature of the middle layer, As a feature of the deep layer, The characteristics are output by the bottleneck layer; (2) For each layer of feature graphs in the feature set, randomly extracting a fixed number of discrete space coordinate points in a space dimension to serve as sampled block anchor points, defining feature vectors extracted from the anchor points by the enhanced image features output by the generator as query samples z, and defining feature vectors extracted from source domain turbidity input features at the same space coordinate points as positive samples Defining the feature vector extracted from the source region turbidity input feature at other random coordinate points of the same feature map as a negative sample A multi-scale Patch NCE contrast loss is calculated based on the query sample, positive sample, and negative sample.
  9. 9. The method for unpaired underwater image enhancement based on structure-aware multi-modal fusion of claim 8, wherein the multi-scale Patch NCE contrast loss is calculated as follows: ; Wherein, the Contrast loss for multi-scale Patch NCE; m is the sum of layers of extracted features in the joint contrast space; A feature layer index represented in the joint contrast space; Represent the first S is the current spatial position index; Temperature super-parameters for controlling the distribution smoothness; is the total number of negative samples.
  10. 10. The method for unpaired underwater image enhancement based on structure-aware multi-modal fusion of claim 9, wherein the total loss function of the training process The definition is as follows: ; Wherein, the The loss of the generator is determined by, And Respectively is And Weights of (2); for the identity contrast loss calculated for the target domain image Y, its calculation process and Completely identical, only the forward propagating input source is replaced with the target domain image Y.

Description

Unpaired underwater image enhancement method based on structure perception multi-mode fusion Technical Field The invention belongs to the technical field of underwater image enhancement, and particularly relates to a non-pairing underwater image enhancement method based on structural perception multi-mode fusion. Background The underwater imaging has important application value in the fields of marine resource exploration, environment monitoring, underwater archaeology and the like. However, due to the strong scattering effect caused by the selective absorption of light by the water body and suspended particles, the problems of color distortion, low contrast, blurred details and the like of the underwater image are common. Particularly in a high-turbidity water body, the scattering effect is severe, so that the image is unevenly atomized and shielded, the edge and geometric structure information of the object are greatly weakened, and the subsequent visual analysis and intelligent recognition tasks are seriously restricted. In order to improve the quality of underwater images, the prior art is mainly divided into two types, namely a restoration method based on a physical model and an enhancement method based on deep learning. Methods based on physical models (e.g., dark channel priors, underwater light propagation models, etc.) recover sharp images by inverting the underwater imaging process. The method relies on specific physical assumption and parameter estimation, however, the real underwater environment is complex and changeable, and under the condition of high turbidity or dynamic scattering, accurate estimation of parameters such as transmissivity, background light and the like is extremely difficult, and problems such as incomplete defogging, color distortion or edge artifact and the like are easily caused. In recent years, image enhancement methods based on deep learning, particularly Convolutional Neural Networks (CNNs), have made remarkable progress in this field. The CNN can effectively fit the mapping relation between the underwater degradation and the clear image through end-to-end learning. However, the inherent local receptive field limitations of CNN make it difficult to model a wide range of global scattering distributions, and poor results in dealing with non-uniform atomization. In addition, to alleviate the problem of difficult acquisition of paired training data, some studies have introduced unpaired image translation frameworks (e.g., cycle-GAN). However, the existing unpaired method often lacks explicit constraint on image structure information, and in extremely high turbidity scenes, geometric outlines of objects are easy to distort or lose. Recently, the transducer architecture was introduced into the field of underwater image enhancement due to its strong long-range dependent modeling capability. However, the conventional method based on the transducer still mainly relies on single RGB spectrum input, and ignores implicit physical geometric cues (such as depth information, local edge structures and the like) in the turbid water body. In high haze media, the edge structures of the object tend to be severely masked by scattered light, and it is difficult to recover the exact geometric profile with RGB features alone. In view of the foregoing, there is a need for an underwater image enhancement method that can integrate multi-modal geometric cues, has global scattering sensing capability, and can maintain structural consistency under unpaired data conditions, so as to overcome the problems of incomplete defogging and detail loss in strong scattering and nonuniform atomization scenes in the prior art. Disclosure of Invention In order to solve the technical problems, the invention provides a non-paired underwater image enhancement method based on structure perception multi-mode fusion, so that the purposes of fully utilizing scene geometric depth and edge structure clues under the condition of non-paired training data, overcoming non-uniform atomization interference under a high-turbidity strong scattering environment and realizing high-fidelity restoration of underwater image colors and high-frequency geometric details are achieved. In order to achieve the above purpose, the technical scheme of the invention is as follows: A non-paired underwater image enhancement method based on structure perception multi-mode fusion comprises the following steps: The method comprises the steps of S1, constructing a generator network based on a U-shaped transducer, wherein the generator network comprises an encoder, a bottleneck layer and a decoder, the encoder is used for receiving multi-mode input tensors, carrying out layering coding through a multi-head self-attention mechanism based on windows and a local enhancement feedforward network, extracting multi-scale deep features, and the decoder is used for carrying out layer-by-layer up sampling on the deep features and splicing the same-scale high-f