CN-116958922-B - Intelligent driving interpretable multi-mode sensing method under bad illumination scene

CN116958922BCN 116958922 BCN116958922 BCN 116958922BCN-116958922-B

Abstract

The application provides an intelligent driving interpretable multi-mode sensing method under a poor illumination scene, which relates to the technical field of intelligent driving, and comprises the steps of acquiring RGB images and 3D point cloud data of a target scene; the method comprises the steps of performing compression processing on 3D point cloud data by using a first source coding model which is trained in advance to obtain compressed 3D point cloud data, performing compression on RGB images by using a second source coding model which is trained in advance to obtain compressed RGB images, performing fusion processing on the compressed 3D point cloud data and the compressed RGB images by using a multi-mode fusion model which is trained in advance and based on a multi-head attention mechanism to obtain fusion characteristics, adding the fusion characteristics with the first point cloud characteristics of the compressed 3D point cloud data to obtain second point cloud characteristics, and processing the second point cloud characteristics by using a three-dimensional detection head to obtain target detection results. The application improves the accuracy of target detection in special scenes such as vehicle shielding, light abrupt change and the like.

Inventors

ZHANG XINYU
Shen Sitian
LI JUN
ZHANG SHIYAN
GUO JILONG
WU FAN

Assignees

清华大学

Dates

Publication Date: 20260512
Application Date: 20230627

Claims (10)

1. An intelligent driving interpretable multi-modal awareness method in a poor lighting scene, comprising: Acquiring RGB images and 3D point cloud data of a target scene; compressing the 3D point cloud data by using a first source coding model which is trained in advance to obtain compressed 3D point cloud data; compressing the RGB image by using a second source coding model which is trained in advance to obtain a compressed RGB image; The method for fusing the compressed 3D point cloud data and the compressed RGB image by utilizing the multi-mode fusion model based on the multi-head attention mechanism which is trained in advance to obtain fusion characteristics comprises the following steps: Processing the compressed 3D point cloud data by utilizing a point cloud feature extraction module to obtain first point cloud features; Processing the first point cloud features by using a first normalization layer to obtain normalized first point cloud features; Splicing the normalized first point cloud features and the normalized image features by using a first splicing unit to obtain first splicing features, and splicing the normalized image features and the normalized first point cloud features to obtain second splicing features; processing the normalized first point cloud feature, the first splicing feature and the second splicing feature by utilizing a multi-head attention processing unit to obtain a first attention feature matrix And a second attention feature matrix ; First attention characteristic matrix by using first extraction branch Processing to obtain a first attention feature map with different layers of semantic information overlapped, and extracting a second attention feature matrix by using a second extraction branch Processing to obtain a second attention feature map overlapped with semantic information of different layers; Processing the first attention feature map overlapped with the semantic information of different levels and the second attention feature map overlapped with the semantic information of different levels by using a second splicing unit to obtain a local fusion feature map; Processing the local fusion feature map by using the full connection layer to obtain a final fusion feature; adding the fusion characteristic and the first point cloud characteristic of the compressed 3D point cloud data to obtain a second point cloud characteristic; and processing the second point cloud characteristics by using the three-dimensional detection head to obtain a target detection result.
2. The method of claim 1, wherein the first source coding model comprises two parallel processing branches and a feature pyramid network, wherein each processing branch is connected with the feature pyramid network, the two parallel processing branches comprise a first processing branch and a second processing branch, each first processing branch and each second processing branch comprise a first Block and an axial attention mechanism module which are connected, a cavity convolution module is arranged between two adjacent first blocks, the first Block is formed by stacking a plurality of groups of convolution layers and normalization layers, and the cavity convolution module is formed by connecting 8 convolution layers with different expansion rates; Compressing the 3D point cloud data by using a first source coding model which is trained in advance to obtain compressed 3D point cloud data, wherein the method comprises the following steps: Processing the 3D point cloud data by using a first Block of a first processing branch to obtain a feature map, and respectively inputting the feature map into an axial attention mechanism module and a cavity convolution module of the first processing branch; processing the input feature map by using an axial attention mechanism module of the first processing branch to obtain a first attention feature map; processing the input feature map by using a cavity convolution module to obtain a local feature map with different scales from the input feature map; Processing the local feature map by using a second processing branch to obtain a second attention feature map; and carrying out fusion processing on the first attention feature map and the second attention feature map by using the feature pyramid network to obtain compressed 3D point cloud data.
3. The method of claim 2, wherein the second source coding model comprises four parallel processing branches and a feature pyramid network, each processing branch is connected with the feature pyramid network, each processing branch comprises a second Block and an axial attention mechanism module which are connected, a cavity convolution module is arranged between every two adjacent second blocks, the second Block is formed by stacking a plurality of groups of convolution layers and normalization layers, and the cavity convolution module is formed by connecting 8 convolution layers with different expansion rates.
4. The method of claim 3, wherein the multi-modal fusion model based on the multi-head attention mechanism comprises a point cloud feature extraction module, an image feature extraction module and a fusion module, wherein the point cloud feature extraction module comprises a convolution layer and a pooling layer, the image feature extraction module comprises a convolution layer and a pooling layer, the fusion module comprises a first normalization layer, a second normalization layer, a first splicing unit, a multi-head attention processing unit, a first extraction branch, a second splicing unit and a full connection layer, the first normalization layer is connected with the point cloud feature extraction module, the second normalization layer is connected with the image feature extraction module, the first splicing unit is connected with the first normalization layer and the second normalization layer respectively, the multi-head attention processing unit is connected with the first normalization layer and the first splicing unit respectively, and the first extraction branch and the second extraction branch are arranged between the multi-head attention processing unit and the second splicing unit respectively.
5. The method of claim 4, wherein the normalized first point cloud feature, first stitching feature, and second stitching feature are processed using a multi-headed attention processing unit to obtain a first attention feature matrix And a second attention feature matrix The method comprises the following steps: normalized first point cloud feature matrix and weight matrix And a weight matrix Respectively multiplying to obtain matrix Sum matrix ; The first spliced characteristic matrix and the weight matrix Multiplying to obtain matrix And splicing the second spliced characteristic matrix and the weight matrix Multiplying to obtain matrix Weight matrix Weight matrix And a weight matrix Parameters are obtained through training; Computing a first attention feature matrix : Wherein, the The number of channels; The size of (2) is: ; And Is of the size of ; Computing a second attention profile matrix : 。
6. The method of claim 5, wherein the first extraction branch comprises a third normalization layer, a first multi-layer perceptron, and a first adder connected in sequence; first attention characteristic matrix by using first extraction branch Processing to obtain a first attention feature map of semantic information overlapped with different layers, wherein the processing comprises the following steps: First attention feature matrix by using third normalization layer Processing to obtain normalized first attention characteristic diagram ; Normalized first attention profile using first multi-layer perceptron pair Processing to obtain a first attention characteristic diagram Feature maps of different semantic information ; First attention characteristic matrix by using first adder Corresponding feature map and feature map And adding to obtain a first attention characteristic diagram for overlapping semantic information of different layers.
7. The method of claim 6, wherein the second extraction branch comprises a fourth normalization layer, a second multi-layer perceptron, and a second adder connected in sequence; using the second extraction branch to perform second attention characteristic matrix Processing to obtain a second attention feature map of semantic information overlapped with different layers, wherein the processing comprises the following steps: second attention feature matrix using fourth normalization layer Processing to obtain normalized second attention characteristic diagram ; Normalized second attention profile using second multi-layer perceptron pair Processing to obtain a second attention characteristic diagram Feature maps of different semantic information ; Using a second adder to perform a second attention feature matrix Corresponding feature map and feature map And adding to obtain a second attention characteristic diagram for overlapping semantic information of different layers.
8. The method of claim 7, wherein the method further comprises: acquiring a plurality of training sample combinations, wherein the training sample combinations comprise a plurality of space-time matched camera image samples and 3D point cloud data samples, and real frames of a plurality of targets are marked on the 3D point cloud data samples; Compressing the 3D point cloud data sample by using a first source coding model to obtain a compressed 3D point cloud data sample; compressing the RGB image by using a second source coding model to obtain a compressed RGB image sample; Utilizing a multi-mode fusion model based on a multi-head attention mechanism which is trained in advance to carry out fusion processing on the compressed 3D point cloud data sample and the compressed RGB image sample, and obtaining a fusion characteristic sample; adding the fusion characteristic sample and the first point cloud characteristic of the compressed 3D point cloud data sample to obtain a second point cloud characteristic sample; processing the second point cloud characteristic sample by using a three-dimensional detection head to obtain a target prediction frame; Calculating a first loss function based on the predicted frame of the target and the real frame of the target; calculating information entropy changes of the four first blocks and the two second blocks respectively, thereby calculating variances of the information entropy changes as second loss functions; Calculating a weighted sum of the first loss function and the second loss function as a total loss function value; model parameters of the first source coding model, the second source coding model and the multi-mode fusion model based on the multi-head attention mechanism are updated based on the total loss function value.
9. The method of claim 7, wherein calculating information entropy changes of the four first blocks and the two second blocks, respectively, thereby calculating variances of the information entropy changes, as the second loss function, comprises: the information entropy changes of the two first blocks and the four second blocks are respectively as follows: And Then the entropy of the average information changes The method comprises the following steps: variance of entropy change of information The method comprises the following steps: Will be The value is as a second loss function.
10. An intelligent driving interpretable multi-modal sensing device in a poor lighting scene, comprising: the acquisition unit is used for acquiring RGB images of the target scene and 3D point cloud data; The encoding unit is used for compressing the 3D point cloud data by utilizing a first source coding model which is trained in advance to obtain compressed 3D point cloud data; The fusion unit is used for carrying out fusion processing on the compressed 3D point cloud data and the compressed RGB image by utilizing a multi-mode fusion model based on a multi-head attention mechanism which is trained in advance, so as to obtain fusion characteristics; The processing unit is used for adding the fusion characteristic and the first point cloud characteristic of the compressed 3D point cloud data to obtain a second point cloud characteristic; the target detection unit is used for processing the second point cloud characteristics by utilizing the three-dimensional detection head to obtain a target detection result; The fusion unit is specifically used for: Processing the compressed 3D point cloud data by utilizing a point cloud feature extraction module to obtain first point cloud features; Processing the first point cloud features by using a first normalization layer to obtain normalized first point cloud features; Splicing the normalized first point cloud features and the normalized image features by using a first splicing unit to obtain first splicing features, and splicing the normalized image features and the normalized first point cloud features to obtain second splicing features; processing the normalized first point cloud feature, the first splicing feature and the second splicing feature by utilizing a multi-head attention processing unit to obtain a first attention feature matrix And a second attention feature matrix ; First attention characteristic matrix by using first extraction branch Processing to obtain a first attention feature map with different layers of semantic information overlapped, and extracting a second attention feature matrix by using a second extraction branch Processing to obtain a second attention feature map overlapped with semantic information of different layers; Processing the first attention feature map overlapped with the semantic information of different levels and the second attention feature map overlapped with the semantic information of different levels by using a second splicing unit to obtain a local fusion feature map; and processing the local fusion feature map by using the full-connection layer to obtain the final fusion feature.

Description

Intelligent driving interpretable multi-mode sensing method under bad illumination scene Technical Field The application relates to the technical field of intelligent driving, in particular to an intelligent driving interpretable multi-mode sensing method under a poor illumination scene. Background At present, a single-mode sensing algorithm applied to intelligent driving is often limited by the performance of a sensor, and cannot meet the sensing requirement of intelligent driving vehicles in poor illumination scenes. The existing solution mainly adopts a multi-mode fusion technology, and utilizes complementary features of different modes to supplement the feature loss of a single mode in scenes such as poor illumination and the like, so that the influence of insufficient illumination on a single sensor is overcome. The current multi-modal fusion technique has the following drawbacks: (1) The traditional fusion method generally adopts result fusion, and effective matching under the difference of the number or the category of the targets in the detection result is difficult to overcome, so that missed detection and false detection of the targets are caused, and risks are brought to perception safety. (2) The existing multi-modal sensing model is generally based on a deep learning algorithm and is excellent in partial sensing tasks. However, the perception model is often designed through experimental results, parameters are optimized through fitting a large amount of data, the problems of poor interpretability and difficult interpretation of the underlying mechanism of the perception function exist, the risk of over-fitting of a specific scene exists, and correct perception of the vehicle under the specific scene such as shielding, sudden light change and the like cannot be ensured. (3) The traditional deep learning network is difficult to evaluate the credibility of the detection result, and has serious defects in the aspect of adapting to the perception safety under the complex dynamic environment. In addition, most of the multi-mode fusion models take the accuracy of a sensing result as a main evaluation index, the reliability of the model in the real-time sensing interaction process with the external environment cannot be guaranteed, and the generalization capability and the credibility of the sensing process of the model are difficult to evaluate. Disclosure of Invention In view of the above, the present application provides an intelligent driving interpretable multi-modal sensing method in poor lighting scenes to solve the above technical problems. In a first aspect, an embodiment of the present application provides a multi-mode sensing method in a poor illumination scene, including: Acquiring RGB images and 3D point cloud data of a target scene; compressing the 3D point cloud data by using a first source coding model which is trained in advance to obtain compressed 3D point cloud data; compressing the RGB image by using a second source coding model which is trained in advance to obtain a compressed RGB image; Utilizing a multi-mode fusion model based on a multi-head attention mechanism which is trained in advance to carry out fusion processing on the compressed 3D point cloud data and the compressed RGB image, so as to obtain fusion characteristics; adding the fusion characteristic and the first point cloud characteristic of the compressed 3D point cloud data to obtain a second point cloud characteristic; and processing the second point cloud characteristics by using the three-dimensional detection head to obtain a target detection result. The first source coding model comprises two parallel processing branches and a characteristic pyramid network, wherein each processing branch is connected with the characteristic pyramid network, each parallel processing branch comprises a first processing branch and a second processing branch, each first processing branch and each second processing branch comprise a first Block and an axial attention mechanism module which are connected, and a cavity convolution module is arranged between two adjacent first blocks; Compressing the 3D point cloud data by using a first source coding model which is trained in advance to obtain compressed 3D point cloud data, wherein the method comprises the following steps: Processing the 3D point cloud data by using a first Block of a first processing branch to obtain a feature map, and respectively inputting the feature map into an axial attention mechanism module and a cavity convolution module of the first processing branch; processing the input feature map by using an axial attention mechanism module of the first processing branch to obtain a first attention feature map; processing the input feature map by using a cavity convolution module to obtain a local feature map with different scales from the input feature map; Processing the local feature map by using a second processing branch to obtain a second attention feature map; an