CN-121982087-A - Self-supervision monocular depth estimation method and system for complex underwater environment
Abstract
The application belongs to the technical field of underwater visual perception and intelligent information processing, and particularly relates to a self-supervision monocular depth estimation method and system for a complex underwater environment, wherein the method comprises a data acquisition module, a depth estimation module, a pose estimation module and a joint loss optimization module; according to the application, by constructing the physical perception polarization multi-scale cooperative attention network PPMCAN, the depth prediction of the complex underwater scene is realized under the condition of not depending on the real depth label. The method comprises six main stages of data construction, network construction, feature enhancement, cross-scale fusion, self-supervision training and reasoning application.
Inventors
- DUAN LIYA
- WANG XINYUE
- Ren Zhikao
- GUO YING
Assignees
- 青岛科技大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260409
Claims (10)
- 1. The self-supervision monocular depth estimation method for the complex underwater environment is characterized by comprising the following steps of: S1, acquiring or constructing monocular underwater image sequence data; S2, constructing a depth estimation main network, wherein the depth estimation main network adopts an encoder-decoder structure, and the encoder extracts multi-scale features layer by layer for an input image; S3, constructing a pose estimation network and completing view reconstruction; s4, constructing a total loss function and performing multi-scale training.
- 2. The self-supervision monocular depth estimation method for a complex underwater environment according to claim 1, wherein step S2 introduces WPPAM modules in the encoding feature extraction stage, and sets an input feature map as: The implementation mode is as follows: step one, introducing a spectrum attention mechanism, and inputting a characteristic diagram Performing global average pooling, and generating wavelength perception weight coefficient by using a multi-layer perceptron and a Sigmoid activation function : Step two, inputting the weighted characteristic diagram into three parallel convolution branches to respectively output pseudo Stokes parameters , And Wherein, the method comprises the steps of, Representing a pseudo-total light intensity component, And Characterizing potential linear polarization differences in the feature space; step three, calculating the pseudo polarization degree based on the pseudo Stokes parameters And pseudo polarization angle : For characterizing the intensity of the local scattering, For characterizing polarization direction prior; in the spatial modulation branch, pseudo-polarization degree is mapped Input convolution layer to generate space saliency map In the channel modulation branch, input characteristic diagram and Performing global average pooling after multiplying element by element to obtain a polarization perception global descriptor: ; Step four, generating the channel attention weight through two full connection layers The method is used for adaptively recalibrating the response of each channel; step five, splicing the spatial modulation characteristics and the channel modulation characteristics in the channel dimension, and fusing the spatial modulation characteristics and the channel modulation characteristics through convolution to obtain WPPAM module output : ; Wherein, the Representing an element-by-element multiplication, Representing stitching along the channel dimension.
- 3. The self-supervision monocular depth estimation method for a complex underwater environment according to claim 1, wherein step S2 introduces AMDGC modules in an intermediate feature enhancement stage for realizing multi-scale dynamic receptive field modeling, and specifically comprises the following steps: step one, uniformly dividing input features into three subgroups along a channel dimension: Each group of channels has a number of C/3, and expansion convolution with different expansion rates is applied to each group: ; Wherein the expansion ratio ; Step two, introducing a frequency sensitive gating branch, namely extracting local gradient intensity from input features by using a Laplacian operator, representing high-frequency structural information, and inputting the information into a lightweight weight predictor after splicing the information with original feature FFF to obtain pixel-level multi-scale weight coefficients: ; wherein, at any pixel position (i, j), the gating coefficient of the kth scale branch The method meets the following conditions: ; step three, weighting each scale branch output pixel by pixel according to the corresponding gating coefficient, and obtaining the fusion characteristic again : ; Step four, after the splicing is completed, fusion is carried out through convolution and batch normalization, and residual error connection is introduced to output final characteristics : ; F is an input feature map.
- 4. The self-supervision monocular depth estimation method for the complex underwater environment according to claim 1, wherein step S2 introduces CSCA a module in a cross-layer fusion stage of a decoder for performing cooperative interaction of deep semantics and shallow details, specifically: Step one, obtaining the deep layer characteristics Shallow features Spatial and channel alignment is carried out on deep features through bilinear interpolation and convolution to obtain Simultaneously downsampling the shallow feature convolution to obtain downsampled shallow features ; Step two, introducing a polarization degree diagram obtained by the WPPAM module as a spatial modulation priori to obtain deep fusion characteristics Collaborative modeling is carried out on the current layer characteristics and the aligned shallow layer characteristics in the same mode to obtain shallow layer fusion characteristics ; Finally, splicing the original features, the deep fusion features and the shallow fusion features along the channel dimension, and performing convolution fusion dimension reduction to obtain CSCA output features : ; Let the current decoding layer feature be The corresponding deep layer is characterized by The shallow layer is characterized by 。
- 5. The self-supervision monocular depth estimation method for a complex underwater environment according to claim 1, wherein step S3 constructs a pose estimation network and completes view reconstruction: s31, the current frame image is processed And adjacent frame images Inputting pose network, predicting relative pose transformation ; S32, combining current frame depth map predicted by depth network Depth map of adjacent frames And relative pose transformation between the current frame and the adjacent frame, reconstructing the adjacent frame image to the current frame view angle to generate a reconstructed image And re-projecting the predicted depth of the adjacent frame to the current frame view to generate a reference depth map The difference between the reconstructed image and the original current frame image is used for forming a self-supervision training error, and the reference depth map is used for constructing a subsequent depth consistency constraint.
- 6. The method of self-supervising monocular depth estimation of a complex underwater environment of claim 5, wherein step S4 the total loss function comprises enhancing luminosity loss: Loss of standard luminosity The definition is as follows: ; Wherein, the Is a weight coefficient; to introduce physical constraints of rapid decay of underwater red light, define red channel decay loss : ; Wherein, the Is the red decay factor; Is global background light; for red channel information in sharp images, enhancing photometric losses The method comprises the following steps: ; Wherein, the Is a weight coefficient.
- 7. The method for estimating depth of self-supervising monocular in a complex underwater environment according to claim 6, wherein the step S4 total loss function further comprises edge-aware smoothing loss : ; Wherein, the Represents the normalized inverse depth of the mean value, The inverse depth map is represented and the depth map is displayed, Is used for the average value of (a), And Representing spatial gradients in the horizontal and vertical directions, respectively.
- 8. The method for self-supervising monocular depth estimation of a complex underwater environment of claim 7, wherein the total loss function further comprises a pose consistency loss : ; Wherein, the 、 The resulting rotation matrix and translation vector are estimated for the underwater image sequence, 、 The resulting rotation matrix and translation vector are estimated for the original image sequence.
- 9. The method of claim 8, wherein to characterize depth prediction uncertainty, the network outputs a logarithmic variance: ; screening reliable regions in combination with geometry consistency masks and low texture exclusion masks Defining joint uncertainty depth correction loss The method comprises the following steps: ; Wherein: ; Total loss function The writing is as follows: ; Wherein, the Are all the weight parameters of the weight-based material, Representing a predicted depth map of a current frame at a pixel location A depth value at which the depth value is to be determined, Representing reference depth values obtained by prediction depth re-projection of adjacent frames and geometric consistency screening.
- 10. A self-supervision monocular depth estimation system facing a complex underwater environment, which is adapted to the self-supervision monocular depth estimation method facing the complex underwater environment according to any of claims 1 to 9, and is characterized by comprising a data acquisition module, a depth estimation module, a pose estimation module and a joint loss optimization module; the data acquisition module is used for acquiring monocular underwater image sequence data; the depth estimation module adopts an encoder-decoder type structure, a wavelength perception pseudo polarization attention module WPPAM and a self-adaptive multi-scale dynamic gating convolution module AMDGC are introduced in the encoding and feature enhancement stage, and a cross-scale cooperative attention module CSCA is introduced in the cross-layer feature fusion stage; the pose estimation module is used for estimating the relative motion relation between the current frame and the adjacent frame; And the output module is used for carrying out joint training on the model from multiple angles such as luminosity consistency, depth smoothness, pose consistency and uncertainty constraint and outputting a prediction result.
Description
Self-supervision monocular depth estimation method and system for complex underwater environment Technical Field The application belongs to the technical field of underwater visual perception and intelligent information processing, and particularly relates to a self-supervision monocular depth estimation method and system for a complex underwater environment. Background Depth estimation is a fundamental problem in computer vision, the goal of which is to recover the three-dimensional geometry of a scene from a two-dimensional image. In the tasks of autonomous underwater vehicle navigation, submarine topography mapping, underwater target detection, underwater engineering structure detection and the like, accurate depth information is an important foundation for realizing environment-aware path planning and structure analysis. The existing depth acquisition mode mainly comprises methods of multi-view stereoscopic vision, sonar, structured light and the like, but the methods generally have the defects of high hardware cost, complex installation and deployment, strong dependence on environment and the like, and are difficult to widely deploy in complex underwater engineering environments. In comparison, monocular depth estimation only depends on a single camera, and has the advantages of low hardware requirements, flexible layout, strong adaptability and the like, so that the monocular depth estimation becomes an important research direction of underwater visual perception. The existing self-supervision monocular depth estimation method is mostly built on the luminosity consistency constraint under the condition of a land scene, namely, the depth information can be learned without an explicit depth label through reconstructing an error training network by a view between adjacent frames. The method has good effect in a common natural scene, but the performance of the method is often obviously reduced when the method is applied to an underwater environment. The reason is that the underwater image is commonly affected by various degradation factors such as light absorption, scattering, color shift, low contrast, local overexposure, shadow shielding and the like, so that the photometric consistency assumption between adjacent frames is seriously destroyed, and the self-supervision reconstruction error is not reliable any more. Disclosure of Invention Based on the problems, the method can more accurately and stably predict the scene depth in a complex underwater engineering environment, and improve the boundary structure restoration capability, depth continuity and robustness of a model to a strong scattering region, and the technical scheme is as follows: A self-supervision monocular depth estimation method facing to a complex underwater environment comprises the following steps: S1, acquiring or constructing monocular underwater image sequence data; S2, constructing a depth estimation main network, wherein the depth estimation main network adopts an encoder-decoder structure, and the encoder extracts multi-scale features layer by layer for an input image; S3, constructing a pose estimation network and completing view reconstruction; s4, constructing a total loss function and performing multi-scale training. Preferably, in step S2, a WPPAM module is introduced in the encoding feature extraction stage, and an input feature map is set as follows: The implementation mode is as follows: step one, introducing a spectrum attention mechanism, and inputting a characteristic diagram Performing global average pooling, and generating wavelength perception weight coefficient by using a multi-layer perceptron and a Sigmoid activation function: (1); Step two, inputting the weighted feature map into three parallel convolution branches to estimate pseudo Stokes parameters: (2); Wherein, the Representing a pseudo-total light intensity component,AndCharacterizing potential linear polarization differences in the feature space; step three, calculating the pseudo polarization degree based on the pseudo Stokes parameters And pseudo polarization angle: (3); (4); Wherein, the To prevent the denominator from being zero,For characterizing the intensity of the local scattering,For characterizing polarization direction prior; in the spatial modulation branch, pseudo-polarization degree is mapped Input convolution layer to generate space saliency mapIn the channel modulation branch, input characteristic diagram andPerforming global average pooling after multiplying element by element to obtain a polarization perception global descriptor: (5); Step four, generating the channel attention weight through two full connection layers The method is used for adaptively recalibrating the response of each channel; step five, splicing the spatial modulation characteristics and the channel modulation characteristics in the channel dimension, and fusing the spatial modulation characteristics and the channel modulation characteristics through convolution to