CN-122023496-A - Single-stage camera-millimeter wave radar fusion depth complement method combining three-dimensional geometric information

CN122023496ACN 122023496 ACN122023496 ACN 122023496ACN-122023496-A

Abstract

The invention discloses a single-stage camera-millimeter wave radar fusion depth complement method combining three-dimensional geometric information. The method is realized based on a single-stage end-to-end depth complement network, wherein a two-dimensional branch takes RGB images, monocular depths and millimeter wave radar depth maps as inputs, multiscale two-dimensional features are extracted through an image encoder, a three-dimensional branch takes millimeter wave radar point clouds as inputs, and multiscale three-dimensional geometric features are obtained through graph convolution. The local-global feature cross attention module based on three-dimensional space distance perception accurately injects three-dimensional geometric prior into a two-dimensional feature channel. And a dense hole pyramid aggregation module is introduced in the decoding stage, and cascading and dense connection aggregation are carried out on the multi-scale hole convolution characteristics, so that the unification of large receptive field and detail maintenance is realized. The method has the advantages of keeping single-stage and simple structure, and simultaneously remarkably improving the depth precision and robustness under complex scenes, and is suitable for real-time three-dimensional perception scenes such as automatic driving, mobile robots, intelligent traffic and the like.

Inventors

ZHANG CHENGHAO
ZHANG FUYI
SHEN HUILIANG
CAO SIYUAN
YU ZHU
ZHANG RUNMIN

Assignees

浙江大学

Dates

Publication Date: 20260512
Application Date: 20260202

Claims (6)

1. A single-stage camera-millimeter wave radar fusion depth complement method combining three-dimensional geometric information is characterized by being realized based on a single-stage depth complement network comprising two-dimensional branches and three-dimensional branches, and the method comprises the steps that the two-dimensional branches take RGB images, monocular depth maps and millimeter wave radar depth maps as inputs, and multiscale two-dimensional features are extracted through an image encoder; the three-dimensional branch takes millimeter wave radar point cloud as input, extracts multi-scale three-dimensional geometric features through a graph convolution network, performs cross-modal fusion on the highest-scale three-dimensional geometric features and the highest-scale two-dimensional features to obtain the highest-scale fusion features, inputs the highest-scale fusion features to a decoder, integrates a dense cavity pyramid aggregation module, obtains aggregation features based on the dense cavity pyramid aggregation module, performs layer-by-layer upsampling and feature decoding on the aggregation features, and finally outputs a dense depth map consistent with the input image in size.
2. The method for depth supplement of single-stage camera-millimeter wave radar fusion combined with three-dimensional geometric information according to claim 1, which is characterized by comprising the following specific steps: s1, given RGB image Monocular depth map Depth map of millimeter wave radar Wherein And The height and width of the image are respectively; s2, at the two-dimensional branch, the monocular depth map is obtained And the millimeter wave radar depth map Stitching in the channel dimension, extracting its preliminary fusion features by one convolution layer, and then with the RGB image extracted via another convolution layer Features of (2) are spliced to obtain two-dimensional branch input features Wherein The number of channels that are characteristic of the two-dimensional branch input; S3, inputting the two-dimensional branches into the features Sending the extracted N-scale two-dimensional features into a multi-stage image encoder based on a residual error network Wherein The scale index is represented as such, Is of a scale The number of the characteristic channels in the lower part, Is of a scale Is a downsampling magnification of (a); S4, in the three-dimensional branch, the millimeter wave radar depth map is obtained Back projection as three-dimensional point cloud Wherein In order to be a point of value, The three-dimensional point cloud input graph convolution network is extracted to obtain the characteristic dimension corresponding to the two-dimensional branch input characteristic dimension Three-dimensional geometric features of individual dimensions Wherein The number of the geometric characteristic channels under the scale i; S5, at the highest scale On, three-dimensional geometric features And two-dimensional features Fusing to obtain the highest-scale fusion characteristics ; S6, fusing the highest scale fusion features Delivering the compressed data to a decoder, wherein the decoder firstly passes through the dense hole pyramid aggregation module pair Performing feature enhancement to obtain polymerization features, and combining Two-dimensional features of each scale The resolution is restored step by a series of up-sampling and feature decoding on the aggregated features, and finally a dense depth map is output 。
3. The method for depth complement of single-stage camera-millimeter wave radar fusion combining three-dimensional geometric information according to claim 2, wherein the graph convolution network in the step S4 is composed of a plurality of cascaded graph convolution layers, each graph convolution layer is implemented by dynamic edge convolution, and the dynamic edge convolution specifically operates as follows for any scale of input point characteristics Wherein Is of a scale The dimension of the input point features is that first, the feature of each point is dynamically found in three-dimensional space Calculating the feature difference of each point feature and the adjacent point, and splicing the feature difference with the point feature to form an edge feature Wherein The number of channels for the point feature, and the edge feature Transforming by a multi-layer perceptron to obtain new edge characteristics Wherein For the dimension of the new edge feature, and finally, for each point feature by a maximum aggregation function New edge features Aggregation is carried out to obtain the output point characteristics under the corresponding scale 。
4. The method for depth complementation by fusion of single-stage camera and millimeter wave radar combined with three-dimensional geometric information according to claim 2, wherein the step S5 is implemented based on a local-global feature cross attention module of three-dimensional space distance perception, and the specific implementation method is as follows: S51, giving the highest scale Three-dimensional geometric features on And two-dimensional features And a three-dimensional point cloud Two-dimensional projection coordinates of (2) ; S52, the highest scale is processed Three-dimensional geometric features on Obtaining the query matrix through linear transformation And based on the three-dimensional point cloud Two-dimensional projection coordinates of (2) From the highest scale by bilinear interpolation Two-dimensional features on Obtaining corresponding local two-dimensional characteristics by middle sampling, and obtaining key matrixes by performing independent linear transformation on the local two-dimensional characteristics twice Sum matrix Wherein Number of hidden layer channels for the attention mechanism; S53, calculating the three-dimensional point cloud Three-dimensional Euclidean distance matrix between midpoints and points And independently defining a learnable distance modulation parameter for each attention head ; S54, executing multi-head attention calculation, wherein the formula is as follows: obtaining updated three-dimensional geometric features Wherein Is that Is a transposed matrix of (a); s55 utilizing two-dimensional projection coordinates For the updated three-dimensional geometric features Performing projection operation, and accumulating projection result to the highest scale Two-dimensional features on Form the highest scale fusion feature at the corresponding position of (a) 。
5. The method of claim 2, wherein in step S6, the dense hole pyramid aggregation module includes five hole convolution branches with different hole rates, each hole convolution branch including one for dimension reduction Convolution and one for feature extraction The cavity convolution step S6 specifically comprises the following steps: S61, giving the highest scale fusion feature As input, first pair Up-sampling once to obtain up-sampling characteristic ; S62, will be Inputting a first hole convolution branch, a pair of first hole convolution branches Performing dimension reduction and feature extraction; S63, will be Splicing the output of the first cavity convolution branch in the channel dimension, and taking the spliced result as the input of the second cavity convolution branch; s64 repeating step S63, i.e. the first The input of each hole convolution branch is composed of And front of The output of each cavity convolution branch is spliced; s65, outputting all the cavity convolution branches Splicing channel dimensions and passing through one Convolution performs feature fusion and channel adjustment to obtain an aggregate feature fused with multi-scale context information ; S66-combination Two-dimensional features at scale, to Progressive resolution recovery through a series of upsampling and feature decoding, ultimately outputting a dense depth map 。
6. Training method for a single-stage depth-completion network comprising two-dimensional branches and three-dimensional branches for performing the method according to any of claims 1-5, characterized by the specific steps of: s1, constructing a training data set, wherein each sample comprises RGB images Monocular depth map Millimeter wave radar depth map And a lidar depth-map as true value And a dense lidar depth map interpolated by the lidar ; S2, defining a composite loss function The loss function is loss supervised by a density Loss of sparsity supervision Weighting constitution Wherein And The weight is preset and used for adjusting the tendency of the loss function; Is that A set of coordinates of the active pixels, Is that An effective pixel coordinate set; S3, supervising the loss by using the density Computing a dense depth map of network output And the dense laser radar depth map L1 distance between them, exploiting sparse supervision loss Computing a dense depth map of network output And the laser radar depth map L1 distance between; s4, adopting OneCycle learning rate scheduling strategy and Adam optimizer to minimize the composite loss function through back propagation Updating all the learnable parameters of the single-stage depth completion network comprising the two-dimensional branches and the three-dimensional branches in an end-to-end manner.

Description

Single-stage camera-millimeter wave radar fusion depth complement method combining three-dimensional geometric information Technical Field The invention belongs to the fields of sensor data processing and artificial intelligence, and particularly relates to a single-stage camera-millimeter wave radar fusion depth complement method combining three-dimensional geometric information. Background With the rapid development of autopilot and robotics, accurate three-dimensional perception of the surrounding environment has become a core technological bottleneck. The accurate and dense scene depth information is important for path planning, obstacle avoidance and scene understanding. At present, the laser radar becomes a mainstream depth sensor due to high-precision and high-density ranging capability, but the large-scale application of the laser radar is limited by high cost, large volume and performance attenuation in severe weather such as rain, fog, snow and the like. In contrast, millimeter wave radars are receiving increasingly wide attention in the field of vehicle sensing with their superior robustness in terms of low cost, small volume, and all-weather operation. However, the imaging mechanism of the millimeter wave radar leads to extremely sparse and noisy point cloud output by the millimeter wave radar, and meanwhile, the imaging mechanism also lacks rich semantic information corresponding to an image, so that the demand of a downstream task on dense depth cannot be directly met. In order to overcome the limitation of millimeter wave radar data, the deep learning method is widely applied to deep complement tasks. In the early scheme, a coding and decoding structure similar to U-Net is adopted, interpolation and complementation are directly carried out on a sparse radar depth map, but the problem of fuzzy object edges, detail loss and insufficient depth precision often occurs in the complementation result due to lack of global structure and semantic information guidance of a scene. To remedy this deficiency, camera-millimeter wave radar fusion schemes have been developed and are becoming the mainstream. Such methods typically employ a dual encoder architecture, one encoder responsible for extracting rich color, texture, and semantic features from the RGB image, and the other extracting depth cues from the sparse radar depth map. The information of the two modes is fused in the feature space, and then a dense depth map is restored through a decoder. The fusion mode effectively combines the semantic perception capability of the camera and the ranging capability of the radar, and the depth complement quality is obviously improved. However, the existing camera-radar fusion method still has two key limitations, namely, a feature fusion mechanism is primary. Most methods simply splice or weight sum on the two-dimensional feature map level, and cannot fully utilize the accurate three-dimensional geometric information contained in the millimeter wave radar point cloud. The sparsity of Lei Dadian clouds when projected as a two-dimensional depth map greatly weakens the three-dimensional spatial topological relation between points, so that the fusion process lacks perception of the real geometric structure of the scene, and the depth inference accuracy of remote objects and complex structural areas is affected. And secondly, part of the method adopts multi-stage treatment. In some schemes, in order to improve performance, the sparse radar depth map is preprocessed or enhanced through an independent network, and then fused with the image. The design of the double or multi-stage increases the complexity, the calculation cost and the deployment difficulty of the system, and does not meet the urgent demands of low-delay and high-efficiency single-stage solutions in the scenes of automatic driving and the like. Disclosure of Invention Aiming at the limitation of the prior art, the invention aims to provide a single-stage camera-millimeter wave radar fusion depth complement method combined with three-dimensional geometric information. The method and the device are used for directly operating on the original radar point cloud by designing a parallel three-dimensional branch, fully excavating three-dimensional geometric and topological information of the radar point cloud by using a graph convolution network, and providing a novel distance perception cross attention mechanism, and injecting three-dimensional geometric priors into two-dimensional image features in an efficient and accurate mode, so that depth interaction and alignment of cross-modal features are realized in a concise single-stage network architecture, and finally a dense depth map with more excellent precision and structural detail is generated. The invention is realized by adopting the following technical scheme: A single-stage camera-millimeter wave radar fusion depth completion method combining three-dimensional geometric information comprises the steps o