CN-121982672-A - Logistics scene-oriented depth enhancement sparse perception method and system

CN121982672ACN 121982672 ACN121982672 ACN 121982672ACN-121982672-A

Abstract

The application relates to the technical field of automatic driving environment perception, in particular to a depth enhancement sparse perception method and a depth enhancement sparse perception system for logistics scenes, which are integrated with sparse trunks in early and multi-scale mode through a dense depth estimation sub-network, the deep binding of the geometric information and the semantic information is realized, and the later splicing is not realized, so that the network has strong geometric reasoning capability from the bottom layer characteristics, and the perception effect of the non-standard target is obviously improved. And simultaneously predicting the centrality and the course degree in the sparse perception frame, and forming a three-dimensional quality gold triangle together with the classification confidence degree. The method effectively solves the problem of interference of a high-resolution low-quality frame in the traditional method, and provides pure and reliable perception input for a subsequent module. Finally, the system characteristics are highly aligned with the perceived demand of the urban logistics scene through the full-link customized design of data (category re-weighting), loss (distance self-adaption) and matching strategies (quality guiding), so that the quality change from a general framework to a special system is realized.

Inventors

WANG ZHENJIANG
ZHANG JIANGFENG
ZHU WANGWANG
LIU HONGYONG

Assignees

蜂巢智行(上海)技术有限公司

Dates

Publication Date: 20260505
Application Date: 20260407

Claims (10)

1. The depth enhancement sparse sensing method for the logistics scene is characterized by comprising the following steps of: acquiring a multi-view image of a vehicle, wherein the multi-view image is acquired by a vehicle-mounted looking-around camera; extracting multi-scale features of the multi-view image based on a pre-constructed backbone network to obtain a multi-scale visual feature map, and extracting a dense depth feature map of the multi-scale visual feature map based on a pre-constructed dense depth estimation sub-network; fusing the visual feature images with the same scale and the dense depth feature images based on a pre-constructed feature fusion module to obtain a fusion enhanced feature image; decoding the fusion enhancement feature map based on a pre-constructed sparse decoder to obtain target features of sparse representation; Mapping the target features into a target category probability distribution, a centrality score and a heading degree score based on a pre-constructed joint target quality estimation module, and calculating a comprehensive confidence coefficient based on the target category probability distribution, the centrality score and the heading degree score; Regression is carried out on the target characteristics based on a pre-constructed refinement decoding module, refined 3D boundary frame parameters of a plurality of targets are obtained, and sparse sensing is completed based on the 3D boundary frame parameters of the targets and the comprehensive confidence, wherein the 3D boundary frame parameters comprise centers, sizes, course angles and speeds.
2. The logistic scene-oriented depth enhancement sparse perception method of claim 1, wherein the dense depth estimation sub-network comprises: An input layer for inputting the multi-scale visual feature map, wherein the multi-scale visual feature map comprises visual features of a plurality of resolution levels; The encoder is used for carrying out continuous repeated downsampling on the multi-scale visual feature map to obtain a multi-level feature map from high resolution to low resolution; And the decoder is used for carrying out continuous and repeated up-sampling on the final-stage feature image in the multi-stage feature image, fusing visual features and feature images of the same resolution level through jump connection in each up-sampling stage to obtain a depth feature image of original resolution, wherein the dense depth estimation sub-network monitors a multi-scale depth truth image generated by utilizing laser radar point cloud projection synchronously acquired with the image in a training stage, calculates depth loss at different output levels of the decoder respectively, and realizes multi-scale depth monitoring from coarse to fine, and the depth loss comprises pixel level regression loss and edge perception smoothing loss.
3. The logistic scene oriented depth enhanced sparse perception method according to claim 1, wherein calculating a comprehensive confidence based on the target class probability distribution, the centrality score and the heading score comprises: Extracting the maximum probability from the target class probability distribution to obtain the classification confidence ; Calculating the classification confidence And obtaining comprehensive confidence coefficient by multiplying the centrality score and the heading degree score, wherein the centrality score represents the matching degree of the center of the detection frame and the true physical center of the target, and the heading degree score represents the reliability of the predicted heading angle.
4. The depth-enhanced sparse perceptron of a logistic scene of claim 1, wherein the construction methods of the backbone network, the dense depth estimation sub-network, the feature fusion module, the sparse decoder, the joint target quality estimation module, and the refinement decoding module comprise: S1, acquiring a multi-view image sample and a laser radar point cloud sample of a vehicle; S2, labeling the multi-view image sample to obtain a target class label and a 3D detection frame parameter label in the multi-view image sample, and generating a centrality truth value based on the 3D detection frame parameter label; S3, inputting the multi-view image sample into a backbone network, obtaining a dense depth feature map output by the dense depth estimation sub-network, 3D boundary frame parameters output by the fine decoding module, and obtaining target category probability distribution, centrality fraction and heading degree fraction output by the combined target quality estimation module; s4, generating a course degree true value based on the 3D detection frame parameters and the 3D detection frame parameter labels; S5, calculating depth loss of the sparse depth truth image and the dense depth feature image, regression loss of the 3D detection frame parameter label and the 3D boundary frame parameter, classification loss of the target class label and the target class, centrality loss of the centrality truth value and the centrality fraction, and heading degree loss of the heading degree truth value and the heading degree fraction based on a first pre-constructed loss function, and calculating total loss based on the depth loss, the regression loss, the classification loss, the centrality loss and the heading degree loss; s6, back propagation is carried out based on the total loss, and parameters of the backbone network, the dense depth estimation sub-network, the feature fusion module, the sparse decoder, the joint target quality estimation module and the fine decoding module are adjusted by combining a gradient descent method; And S7, repeating the steps S3-S6 until training is completed.
5. The depth-enhanced sparse representation method for logistic scenes according to claim 4, wherein the mathematical expression of the loss function is: In the formula, Indicating the total loss of the total of the components, Representing a loss of classification, Representing the regression loss of the model, Indicating a loss of depth and, Indicating a loss of centrality, Indicating a loss of heading level, Representing the first loss balance coefficient, Representing a second loss balance coefficient.
6. The depth-enhanced sparse representation method for a logistic scene of claim 4, wherein generating a centrality truth value based on the 3D detection frame parameter tag comprises: Analyzing a 3D detection frame label from the 3D detection frame parameter label; Projecting the actual center of the target in the 3D detection frame label to the multi-view image based on the camera internal and external parameter matrix to obtain the pixel coordinate of the actual center; And executing inquiry on the center of the 3D detection frame label in the 2D Gaussian heat map to obtain a centrality true value.
7. The logistic scene oriented depth enhancement sparse perception method of claim 4, wherein generating a heading true value based on the 3D detection frame parameters and the 3D detection frame parameter tags comprises: Resolving an orientation angle label from the 3D detection frame parameter label And analyzing the predicted value of the navigation angle from the 3D detection frame parameters ; Based on the heading angle label The heading angle predicted value Calculating a true heading degree value, wherein the mathematical expression of the true heading degree value is as follows: In the formula, Indicating a true value of heading level.
8. Depth enhancement sparse perception system for logistics scene, which is characterized by comprising: the acquisition module is used for acquiring multi-view images of the vehicle, wherein the multi-view images are acquired through the vehicle-mounted looking-around camera; the feature extraction module is used for carrying out multi-scale feature extraction on the multi-view image based on a pre-constructed backbone network to obtain a multi-scale visual feature map, and extracting a dense depth feature map of the multi-scale visual feature map based on a pre-constructed dense depth estimation sub-network; the fusion module is used for fusing the visual feature images with the same scale and the dense depth feature images based on the pre-constructed feature fusion module to obtain a fusion enhancement feature image; The sparse decoding module is used for decoding the fusion enhancement feature map based on a pre-constructed sparse decoder to obtain target features of sparse representation; The quality estimation module is used for mapping the target characteristics into target category probability distribution, centrality score and heading degree score based on a pre-constructed joint target quality estimation module, and calculating comprehensive confidence coefficient based on the target category probability distribution, the centrality score and the heading degree score; the sparse sensing module is used for carrying out regression on the target characteristics based on the pre-built fine decoding module to obtain fine 3D boundary frames of a plurality of targets, and completing sparse sensing based on the 3D boundary frames of the targets and the comprehensive confidence.
9. An electronic device is characterized by comprising a processor and a memory; The memory is configured to store a computer program, and the processor is configured to execute the computer program stored in the memory, to cause the electronic device to perform the method according to any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method according to any one of claims 1 to 7.

Description

Logistics scene-oriented depth enhancement sparse perception method and system Technical Field The application relates to the technical field of automatic driving environment sensing, in particular to a depth enhancement sparse sensing method and system for logistics scenes. Background As autopilot technology deepens into certain commercial scenarios (e.g., unmanned logistics), generic sensing schemes face new challenges in vertical scenarios. The urban logistics scene has the characteristics of 1) high target diversity, including non-standard special-shaped objects such as express tricycles, trolleys, containers and the like besides standard passenger cars, 2) complex operation environment, often switching between semi-closed areas such as warehouses, communities, narrow lanes and the like and open roads, and existence of a large number of obstacles which are long and medium in distance and partially shielded, 3) sensitivity to cost and reliable perception on a limited calculation platform. The existing perception framework (such as a method based on dense BEV or general sparse query) facing the general driving scene has the following defects: (1) The depth estimation capability is insufficient, namely the depth estimation module of the general framework is mostly of auxiliary property, and the depth estimation error is larger for nonstandard targets which frequently appear in logistics scenes and lack texture or priori size information. (2) The traditional method mainly relies on classification confidence to filter a detection frame, but in 3D perception, the position certainty (centrality) and the direction certainty (heading) of a target are also critical. Ignoring these factors can lead to "high confidence but position-drift" mispredictions in subsequent tracking and planning modules. (3) The scene adaptability is weak, the network structure or the loss function is not optimized aiming at the specific distribution of the target scale and the distance in the logistics scene, and the performance attenuation under long tail distribution is obvious. Therefore, a special perception system for omnibearing optimization aiming at the characteristics of urban logistics scenes is urgently needed. Disclosure of Invention In view of the above, the present application aims to provide a depth enhancement sparse sensing method and system for logistics scene, so as to solve the problems in the background technology. In order to achieve the above purpose, the present application adopts the following technical scheme: the depth enhancement sparse sensing method for the logistics scene comprises the following steps: acquiring a multi-view image of a vehicle, wherein the multi-view image is acquired by a vehicle-mounted looking-around camera; extracting multi-scale features of the multi-view image based on a pre-constructed backbone network to obtain a multi-scale visual feature map, and extracting a dense depth feature map of the multi-scale visual feature map based on a pre-constructed dense depth estimation sub-network; fusing the visual feature images with the same scale and the dense depth feature images based on a pre-constructed feature fusion module to obtain a fusion enhanced feature image; decoding the fusion enhancement feature map based on a pre-constructed sparse decoder to obtain target features of sparse representation; Mapping the target features into a target category probability distribution, a centrality score and a heading degree score based on a pre-constructed joint target quality estimation module, and calculating a comprehensive confidence coefficient based on the target category probability distribution, the centrality score and the heading degree score; Regression is carried out on the target characteristics based on a pre-constructed refinement decoding module, refined 3D boundary frame parameters of a plurality of targets are obtained, and sparse sensing is completed based on the 3D boundary frame parameters of the targets and the comprehensive confidence, wherein the 3D boundary frame parameters comprise centers, sizes, course angles and speeds. In an embodiment of the present application, the dense depth estimation sub-network includes: An input layer for inputting the multi-scale visual feature map, wherein the multi-scale visual feature map comprises visual features of a plurality of resolution levels; The encoder is used for carrying out continuous repeated downsampling on the multi-scale visual feature map to obtain a multi-level feature map from high resolution to low resolution; And the decoder is used for carrying out continuous and repeated up-sampling on the final-stage feature image in the multi-stage feature image, fusing visual features and feature images of the same resolution level through jump connection in each up-sampling stage to obtain a depth feature image of original resolution, wherein the dense depth estimation sub-network monitors a multi-scale depth truth image genera