CN-116625383-B - Road vehicle sensing method based on multi-sensor fusion

CN116625383BCN 116625383 BCN116625383 BCN 116625383BCN-116625383-B

Abstract

The invention discloses a road vehicle sensing method based on multi-sensor fusion, and belongs to the technical field of unmanned driving. The invention introduces a attention mechanism and a transducer structure to aggregate different modal characteristics in a road vehicle perception model adopting a double-flow network, so that the two modal characteristics are mutually enhanced, the network learns global dependency, and global context information is integrated in a characteristic extraction stage. The invention utilizes the self-attention mechanism of the transducer, the model can naturally and simultaneously perform intra-mode and inter-mode fusion, and the potential interaction between the image domain and the laser radar domain is captured robustly, thereby obviously improving the vehicle detection performance and improving the limitation of the prior fusion method. The invention utilizes the re-fusion of the RoI, and based on the refined correction detection result of the re-fusion result, realizes the accurate perception of the vehicle on the road.

Inventors

WANG GUOQING
WANG ZHIWEN
WANG YUQING
YANG YANG
SHEN HENGTAO

Assignees

电子科技大学
徐州勇强自动化设备有限公司

Dates

Publication Date: 20260512
Application Date: 20230505

Claims (6)

1. The road vehicle sensing method based on the multi-sensor fusion is characterized by comprising the following steps of: Step 1, acquiring target detection data for training a road vehicle perception model, wherein the target detection data comprises image data acquired by a camera device and laser radar point cloud data acquired by a laser scanner; step2, constructing and training a road vehicle perception model; The road vehicle perception model comprises a point-level feature fusion module and a region-of-interest RoI-level feature fusion module; The point-level feature fusion module comprises an image branch backbone network and a laser radar branch backbone network, wherein the image branch backbone network is used for carrying out multi-scale image feature extraction on image data to obtain a plurality of first intermediate feature images with different scales and a final output feature image of the image branch backbone network; the laser radar branch backbone network is used for carrying out multi-scale image feature extraction on the laser radar aerial view to obtain a plurality of second intermediate feature images with different scales and a final output feature image of the laser radar branch backbone network, wherein the scales of the first intermediate feature image and the second intermediate feature image are the same in number, and the feature image dimensions of the first intermediate feature image and the second intermediate feature image at the same level are identical; The first and second intermediate feature images of the first stage are subjected to an attention fusion module to obtain a first-stage fusion intermediate feature image, and the first-stage fusion intermediate feature image and the first intermediate feature image of the first stage are added and then continue to participate in forward computation on an image branch backbone network; The image branch backbone network and the same-level intermediate feature image of the laser radar branch backbone network are fused through a fusion module of a transducer from the second-level intermediate feature image to obtain a current-level fusion intermediate feature image, and the current-level fusion intermediate feature image is added with the first intermediate feature image and the second intermediate feature image of the current level respectively and then continuously participates in forward calculation on the two backbone networks respectively; The RoI level feature fusion module carries out road vehicle detection on the final output feature map of the image branch backbone network through a convolution layer to obtain a plurality of 3D candidate frames and road vehicle identification results thereof, and a plurality of 3D detection frames are obtained after score threshold processing and non-maximum value inhibition operation; respectively projecting the 3D detection frames into a laser radar aerial view space and a two-dimensional image space of image data, respectively extracting RoI features of two modes of the image data and the laser radar aerial view through the RoI features, splicing the RoI features of the two modes, inputting the two modes into a thinning module based on at least two fully-connected layers to predict thinning correction of each 3D detection frame, and combining a thinning correction result with the 3D detection frames to obtain a final road vehicle detection result; When the road vehicle perception model is trained based on the target detection data and the corresponding label data, the model total loss function is the sum of the classification loss and the regression loss, and when the preset training convergence condition is met, the road vehicle perceptron based on multi-sensor fusion is obtained and is used for road vehicle perception of the unmanned vehicle in the driving process.
2. The method of claim 1, wherein the color space of the image data is in RGB format.
3. The method according to claim 1, wherein the attention fusion module is specifically: First intermediate feature map F 1 i and second intermediate feature map for current stage i After splicing according to the channel dimension, inputting a convolution layer to obtain a primary fusion feature map of the current level i The attention mechanism is adopted to preliminarily fuse the feature map Weighting each characteristic in the map, and outputting to obtain an attention map And then will be And (3) with And multiplying the elements by each other to obtain a fused intermediate feature map output by the attention fusion module.
4. The method of claim 1, wherein the fusion module of the transducer is specifically: Definition c×h×w represents the first intermediate feature map F 1 i and the second intermediate feature map of the current stage i Wherein C represents the number of channels of the intermediate feature map and h×w represents the resolution of the intermediate feature map; First intermediate feature map F 1 i and second intermediate feature map for current stage i Respectively according to the sequence of the arrangement matrix, respectively forming a sequence composed of discrete marks And Wherein the sequence And Is HW C; Splice sequences And Obtaining a fusion sequence I i of the current level I, wherein the dimension of the fusion sequence I i is 2HWxC, 2HW marks in the fusion sequence I i are respectively represented by a feature vector with one dimension of C, each feature vector representation is complemented by a leachable position code so as to combine position induction deviation and distinguish space information of different marks; the fusion sequence I i uses linear projection onto three weight matrices to calculate a set of query Q, key K, and value V: Q=IM q ,K＝IM k ,V＝IM v wherein M q ,M k ,M v is a weight matrix of Q, K, V respectively; The self-attention layer uses the scaled dot product between Q and K to calculate the attention weight and aggregates the value of each query to infer the refined output Z: Based on the projection mapping of the splicing of a plurality of outputs Z, obtaining an output Z 'of a multi-head attention mechanism, and calculating an output sequence with the same scale size as an input sequence I i by using nonlinear transformation, wherein O=MLP (Z' ') +Z' ', MLP () represents a multi-layer perceptron, and Z' '=Z' +I i ; according to the formation sequence And Is inverse to the permutation matrix of (a) and converts the output sequence O into corresponding first and second intermediate feature maps F 1 i and F 1 i , respectively Feature map F 1 ′ i and Map F 1 ′ i and Respectively with the first intermediate feature map F 1 i and the second intermediate feature map After addition, the forward computation continues to take part in on the lidar branch backbone.
5. The method according to any one of claims 1 to 4, characterized in that the model total loss function is specifically set as: L=L cls +λL reg Wherein λ represents the balance parameter, and the classification loss L cls and the regression loss L reg are respectively: Wherein p c represents the predicted classification score, l c represents the binary label of the vehicle and the background, and N represents the total number of samples; Where (x, y, z) is the coordinates of the three-dimensional frame, (w, h, D) represents the size of the 3D detection frame in the x, y, z direction, t represents the direction, N pos represents the number of positive samples, D represents the smoothed L1 norm, and p k and L k are the offsets of the predicted and actual values, respectively.
6. The method of claim 5, wherein p k is specifically: if k is E (x, y, z), p k ＝(k-a k )/a k , wherein a k is anchor point coordinates; If k ε (w, h, d), then p k ＝log(k/a′ k ), where a' k is the anchor point size.

Description

Road vehicle sensing method based on multi-sensor fusion Technical Field The invention belongs to the technical field of unmanned operation, and particularly relates to a road vehicle sensing method based on multi-sensor fusion. Background When the unmanned technology is implemented, the multi-sensor fusion technology is often adopted to fully perceive the global context information of the three-dimensional scene. Multisensor fusion, i.e., fusing multiple different kinds of sensor information, in order to obtain more valuable features. Meanwhile, the perceptibility, reliability and robustness of the detection algorithm can be further improved by fusing complementary information of different modes. Due to the different fusion strategies, multi-sensor deep fusion architectures can be broadly divided into two major categories, early fusion and late fusion. Early fusion refers to the fusion of input representations with information of multiple modalities, i.e., input representations fused with different sensors, often at the pixel level, prior to feature extraction with a deep neural network. The most common approach to such fusion is to perform a simple join operation on the input signals, which exploits the correlation and interaction between low-level features of each modality. Late fusion is then fusion of features from different sensor branches, which is typically a fusion of feature levels. Late fusion uses a separate subnet to generate a feature representation for each modality and typically integrates the encoding results of all the separate modality models using a fully connected layer. Because the network for extracting each branch by the late fusion method can be different, modeling can be better carried out on each mode data, and greater flexibility is realized. The existing early fusion scheme performs fusion operation on a data input layer, but it is generally considered that information contained in data streams of different sensors can find correlation in high-level dimension, especially RGB image and point cloud data which are relied on by unmanned technology, and the original data correlation is poor, so that the difficulty of directly fusing two data by using the early fusion scheme is high. In the late fusion scheme, new fusion ideas are continuously presented, and the fusion ideas based on geometric modes are commonly used nowadays, and the characteristics of two modes are fused by using a carefully designed characteristic projection mechanism. Under this fusion mechanism, information is typically gathered from a local neighborhood around each feature in the projected two-or three-dimensional space. Although these methods perform better than just direct feature addition or stitching, we observe that lidar point cloud data is very sparse compared to RGB images, fusion is very limited, and limitations in its architectural design prevent their performance in real complex dynamic scenarios. The unmanned vehicle expects to obtain the complementary information of different sensors, how to fully fuse the characteristics of two modes is a core problem of unmanned road vehicle detection, but a large number of complex objects often exist in a real three-dimensional scene, and as descriptions of different modes on the same object are usually different, it is often difficult to accurately fuse the information of corresponding objects to form more effective characterization. Meanwhile, the information amount of the whole three-dimensional scene is huge, and the characteristics extracted by the network are not all effective. Disclosure of Invention The invention provides a road vehicle sensing method based on multi-sensor fusion, which can be used for improving the sensing accuracy of an unmanned vehicle. The invention adopts the technical scheme that: a road vehicle perception method based on multi-sensor fusion, the method comprising the steps of: Step 1, acquiring target detection data for training a road vehicle perception model, wherein the target detection data comprises image data acquired by a camera device and laser radar point cloud data acquired by a laser scanner; step2, constructing and training a road vehicle perception model; The road vehicle perception model comprises a point-level feature fusion module and a region-of-interest RoI-level feature fusion module; The point-level feature fusion module comprises an image branch backbone network and a laser radar branch backbone network, wherein the image branch backbone network is used for carrying out multi-scale image feature extraction on image data to obtain a plurality of first intermediate feature images with different scales and a final output feature image of the image branch backbone network; the laser radar branch backbone network is used for carrying out multi-scale image feature extraction on the laser radar aerial view to obtain a plurality of second intermediate feature images with different scales and a final output feature imag