CN-122023494-A - Monocular depth estimation method based on edge perception multi-scale attention

CN122023494ACN 122023494 ACN122023494 ACN 122023494ACN-122023494-A

Abstract

The invention discloses a monocular Depth estimation method based on edge perception multi-scale attention, which comprises the steps of S1, acquiring historical image data acquired by a sensor, wherein the sensor comprises a stereo camera, an RGB-D camera and a laser radar, S2, training a monocular Depth estimation model EMCA-Depth based on an edge perception multi-scale attention mechanism according to the historical image data, and S3, inputting sensing data of a target to be detected into the trained monocular Depth estimation model based on the edge perception multi-scale attention mechanism for Depth estimation. By adopting the technical scheme of the invention, the accuracy of depth estimation is obviously improved.

Inventors

LI CHUNQUAN
TAO LING
ZHAO QINGMIN
LIU HONGHAO
LIU YUNYU
ZHANG DEXIN
YU JUNZHI
ZHANG ZHIJUN
WU JUNYUN
LI YACHAO
CHEN LIMIN

Assignees

南昌大学

Dates

Publication Date: 20260512
Application Date: 20260224

Claims (5)

1. A monocular depth estimation method based on edge-aware multi-scale attention, comprising: step S1, acquiring historical image data acquired by a sensor, wherein the sensor comprises a stereoscopic camera, an RGB-D camera and a laser radar; S2, training a monocular depth estimation model based on an edge perception multi-scale attention mechanism according to historical image data; And S3, inputting the sensing data of the target to be measured into a trained monocular depth estimation model based on an edge perception multi-scale attention mechanism to perform depth estimation.
2. The method for monocular depth estimation based on edge-aware multi-scale attention of claim 1, wherein the monocular depth estimation model based on edge-aware multi-scale attention mechanism comprises an edge-aware multi-scale convolution attention decoder and a lightweight single frame model, wherein the edge-aware multi-scale convolution attention decoder comprises a multi-scale convolution attention module MSCAM, a large-core group attention gate LGAG, an efficient up-convolution block EUCB and a segmentation head SH, and the lightweight single frame model is used as a priori knowledge to guide a multi-frame model to enhance the perception capability of the multi-frame model on key features.
3. The edge-aware multi-scale attention-based monocular depth estimation method of claim 2, wherein the edge-aware multi-scale convolutional attention decoder employs a triplet loss function based on edge-aware semantic guidance for refining and constraining edge features.
4. The monocular depth estimation method based on edge-aware multi-scale attention according to claim 3, wherein the stereo camera generates a dense depth map by synchronously acquiring left and right view image pairs with fixed baselines and utilizing a stereo matching algorithm, the RGB-D camera directly synchronously acquires color images and corresponding pixel-level depth maps thereof by using structured light or time-of-flight principle, and the laser radar acquires sparse three-dimensional point cloud data in a scene by transmitting laser pulses and receiving reflected signals.
5. The edge-aware multi-scale attention-based monocular depth estimation method of claim 4, wherein the edge-aware multi-scale convolutional attention decoder for hierarchically aggregating context information from multi-frame inputs and explicitly enhancing edge feature expression comprises: The multi-scale convolution attention module MSCAM captures and sharpens edges and structural features in the image under different receptive fields while enhancing cross-channel interaction by integrating channel attention, large kernel space attention and multi-scale depth separable convolution so as to cope with depth estimation challenges of multi-scale objects; The large-core grouping attention gate LGAG dynamically adjusts and fuses jump connection features from the encoder by using high-level semantic information as a gating signal, selectively enhances feature transfer related to depth edges, suppresses irrelevant background interference, and optimizes feature fusion in the decoding process; the efficient upper convolution block EUCB adopts depth separable convolution and up-sampling operation, so that the integrity of a depth boundary is effectively maintained while the spatial resolution of a feature map is improved, and boundary blurring introduced in an up-sampling process is avoided; the segmentation head SH projects the multi-level and multi-scale features output by the decoder into a final full-resolution depth map; The lightweight single-frame model is used as a parallel and efficient priori knowledge extractor, receives the current single-frame image and rapidly encodes the current single-frame image into a robust initial depth feature and a multi-scale geometric priori, and priori information generated by the lightweight single-frame model is injected into the multi-frame model to guide and enhance the perception capability of the multi-frame model on key edges and structural features, so that the prediction robustness and accuracy of the multi-frame model in a weak texture region and a complex boundary are improved, and the depth estimation with clear edges and consistent scales is realized.

Description

Monocular depth estimation method based on edge perception multi-scale attention Technical Field The invention belongs to the technical field of computer vision processing, and particularly relates to a monocular depth estimation method based on edge perception multi-scale attention. Background Depth estimation provides geometric information for different application scenarios as a core upstream task in computer vision. The method is particularly applied to the fields of 3D reconstruction, robot navigation, augmented reality, automatic driving and the like. The accurate monocular depth estimation is a geometrically perceived keystone with the objective of predicting the distance of each pixel from a single input image to the camera and outputting a depth map. Although physical sensors such as stereo cameras, RGB-D cameras, and lidar can provide pixel-by-pixel depth information, they have significant limitations. Stereo cameras perform stereo matching with binocular images with fixed baselines to generate dense depth maps, but their performance is highly dependent on the richness of scene textures and requires accurate camera calibration parameters. Similarly, lidar sensors and RGB-D cameras measure depth based on structured light or time-of-flight principles, which limit their usefulness due to high hardware costs and sensitivity to environmental conditions. In contrast, a self-supervising monocular depth estimation method having the characteristics of low hardware dependency, high calculation efficiency, and the like has become a viable alternative to realize dense depth prediction, and has received attention in recent years. Previous work can be divided into two categories. The self-supervised single frame depth estimation method uses photometric reprojection errors of successive video frames during training, but relies on single frame input only when reasoning. In contrast, multi-frame depth estimation methods model multi-view geometric consistency by constructing cost volumes (cost volumes) using successive video frames in both training and reasoning phases. Notably, some studies have attempted to guide self-supervised depth estimation with pre-trained semantic segmentation networks in hopes of improving depth prediction accuracy while simultaneously solving edge dilation problems. However, such methods generally suffer from depth boundary blurring when dealing with depth prediction of object edge regions. This results in a significant reduction in depth prediction accuracy for small objects or structure edges, causing their depth values to tend to merge against the background or adjacent objects, thereby destroying the geometric integrity of the scene and potentially causing downstream task chaining errors. Meanwhile, when objects or structures of different scales in a scene are processed, the methods also often show the defect of inconsistent scale (Scale Inconsistency), and the accuracy and consistency of the depth prediction of the distant large structures and the near fine details are difficult to simultaneously maintain. Disclosure of Invention In order to solve the problems in the prior art, the invention provides a monocular depth estimation method based on edge perception multi-scale attention. In order to achieve the above object, the present invention provides the following solutions: a monocular depth estimation method based on edge-aware multi-scale attention, comprising: step S1, acquiring historical image data acquired by a sensor, wherein the sensor comprises a stereoscopic camera, an RGB-D camera and a laser radar; S2, training a monocular depth estimation model based on an edge perception multi-scale attention mechanism according to historical image data; And S3, inputting the sensing data of the target to be measured into a trained monocular depth estimation model based on an edge perception multi-scale attention mechanism to perform depth estimation. Preferably, the stereo camera generates a dense depth map using a stereo matching algorithm by synchronously acquiring left and right view image pairs with a fixed baseline. The RGB-D camera directly and synchronously collects color images and corresponding pixel-level depth maps through the principles of structured light or flight time and the like. The laser radar obtains sparse three-dimensional point cloud data in a scene by transmitting laser pulses and receiving reflected signals. The multi-source heterogeneous data are acquired, and are aimed at providing more comprehensive and complementary scene geometry and appearance information for subsequent training or serving as supervision signals and evaluation references so as to improve generalization capability and robustness of a trained depth estimation model. The monocular depth estimation model based on the edge-aware multi-scale attention mechanism comprises an edge-aware multi-scale convolution attention decoder and a lightweight single-frame model, wherein the edge-aware multi-