CN-117351443-B - Pure visual target detection method based on image-pseudo point cloud feature fusion

CN117351443BCN 117351443 BCN117351443 BCN 117351443BCN-117351443-B

Abstract

The invention discloses a pure visual target detection method based on image-pseudo point cloud feature fusion, and belongs to the technical field of automatic driving. The method mainly comprises the following five steps of 1, training a pure visual three-dimensional target detection model based on an image, 2, training a three-dimensional target detection model based on laser radar point cloud, 3, designing a mask automatic coder-decoder to realize pseudo point cloud feature generation, 4, adaptively fusing image features and pseudo point cloud features, and 5, finely adjusting to obtain the pure visual target detection model based on pseudo feature fusion. Based on the pre-training scheme provided by the invention, only the corresponding laser radar point cloud data is provided in the model training process, and high-quality pseudo point cloud features can be generated by taking a camera image as input in reasoning, so that the performance improvement is realized on the basis of the original pure vision model through the cross-modal feature fusion of the image features and the pseudo point cloud features.

Inventors

DING YONG
HAN HAO
HONG YU
CHENG HUAYUAN
HE LENIAN

Assignees

浙江大学

Dates

Publication Date: 20260512
Application Date: 20230920

Claims (10)

1. The pure visual target detection method based on image-pseudo point cloud feature fusion is characterized by comprising the following steps of: Acquiring looking-around image data acquired by a vehicle-mounted camera, performing feature extraction and projection by using a BEV feature generation network based on an image to generate a three-dimensional aerial view feature of an image mode, and realizing an image-based target detection task by using an image target detection output head to pretrain a first model formed by the BEV feature generation network based on the image and the image target detection output head; step (2), acquiring point cloud data acquired by a laser radar, generating a network based on BEV characteristics of the point cloud after data preprocessing to generate a three-dimensional aerial view characteristic of a point cloud mode, utilizing a point cloud target detection output head to realize a target detection task based on the point cloud, and pre-training a second model formed by the network generated based on the BEV characteristics of the point cloud and the point cloud target detection output head; The three-dimensional aerial view characteristics of the image modes generated by the model after the pre-training in the step (1) are input into an encoder after being subjected to the proportional random masking by a random mask generator, the encoding result of an unmasked part is output, a learnable feature vector is designed to serve as the encoding result of the masking part, and the learnable feature vector and the encoding result are jointly input into a decoder to obtain pseudo point cloud characteristics; Performing fine tuning training on a pure visual target detection model consisting of a pseudo point cloud feature generation network, a multi-mode feature fusion network, a fusion target output head and an image-based BEV feature generation network in a pre-trained first model, and obtaining the pure visual target detection model based on pseudo feature fusion after training; And (5) generating a target detection result by using a pure visual target detection model based on pseudo feature fusion by using the looking-around image acquired by the vehicle-mounted camera as input.
2. The method for detecting a pure visual target based on image-pseudo point cloud feature fusion according to claim 1, wherein the image-based BEV feature generation network adopts a two-dimensional neural network.
3. The method for detecting a pure visual target based on image-pseudo point cloud feature fusion according to claim 1, wherein in the step (2), the second model further comprises a voxel feature extraction network for preliminarily encoding the irregular point cloud data into voxel features, so as to implement data preprocessing of the point cloud data acquired by the laser radar.
4. The pure visual target detection method based on image-pseudo point cloud feature fusion according to claim 1, wherein the point cloud-based BEV feature generation network adopts a three-dimensional neural network.
5. The method for detecting a pure visual target based on image-pseudo point cloud feature fusion according to claim 1, wherein in the step (3), the mask ratio is 50% -75%.
6. The method for detecting a pure visual target based on image-pseudo point cloud feature fusion according to claim 1, wherein in the step (3), the encoder and the decoder adopt a transducer structure.
7. The method for detecting a pure visual target based on image-pseudo point cloud feature fusion according to claim 6, wherein in the step (3), the encoder and the decoder are designed in an asymmetric lightweight manner, and the number of encoder layers and decoder layers is 3:1.
8. The method for detecting a pure visual target based on image-pseudo point cloud feature fusion according to claim 1, wherein in the step (3), the learnable feature vector is the same as the three-dimensional aerial view feature dimension of the image modality.
9. The method for detecting a pure visual target based on image-pseudo point cloud feature fusion according to claim 1, wherein the step (4) comprises: Combining an image-based BEV feature generation network and a pseudo point cloud feature generation network in a first model after pre-training, and designing a multi-modal feature fusion network and a fusion target output head, wherein the multi-modal feature fusion network comprises a splicing layer and a convolutional neural network layer; Taking a panoramic image acquired by a vehicle-mounted camera as input, performing feature extraction by a BEV feature generation network based on the image in the first model after pre-training, and projecting to generate a three-dimensional aerial view feature of an image mode; taking the three-dimensional aerial view characteristic of the image mode obtained in the step (4.3) as input, and generating pseudo point cloud characteristics by a pseudo point cloud characteristic generation network; taking the three-dimensional aerial view characteristic of the image mode obtained in the step (4.2) and the pseudo point cloud characteristic obtained in the step (4.3) as inputs, and generating a self-adaptive fusion characteristic by a convolutional neural network layer after the three-dimensional aerial view characteristic and the pseudo point cloud characteristic are spliced by a splicing layer in the multi-mode characteristic fusion network; Step (4.5) the self-adaptive fusion characteristic obtained in the step (4.3) is taken as input, and a fusion target output head generates a target detection result; And (4.6) carrying out end-to-end training on the BEV feature generating network, the pseudo point cloud feature generating network, the multi-mode feature fusion network and the fusion target output head based on the image in the first model after pre-training by taking the real target result of the looking-around image acquired by the vehicle-mounted camera in the step (4.2) as a label, and finely adjusting parameters to obtain the pure visual target detection model based on pseudo feature fusion.
10. The pure visual target detection method based on image-pseudo point cloud feature fusion according to claim 1, wherein the image target detection output head, the point cloud target detection output head and the fusion target output head are three independent multi-layer perceptrons.

Description

Pure visual target detection method based on image-pseudo point cloud feature fusion Technical Field The invention belongs to the technical field of automatic driving, and particularly relates to a pure visual target detection method based on image-pseudo point cloud feature fusion. Background At present, an automatic driving vehicle-mounted sensor mainly comprises a laser radar and a Camera, and is mainly divided into a technical scheme based on laser radar (Lidar) point cloud, a technical scheme based on Camera images and a technical scheme based on Multi-mode fusion in the field of target detection algorithms. The scheme based on the point cloud and the multi-mode fusion has excellent performance in a three-dimensional target detection task because the space prior information can be obtained. In contrast, a pure-vision three-dimensional object detection model has a great difference from the former in performance, and one of the very important reasons is that an image lacks real depth information, which is unfavorable for reconstructing a three-dimensional scene. Thus, purely visual three-dimensional object detection is a very challenging task. Although laser radars can provide high-quality point cloud data, cameras are still necessary as general in-vehicle sensors because of their high cost and susceptibility to environmental factors. Compared with a laser radar sensor, the camera has the unique advantages of low price, rich color information, dense perception, convenient deployment and the like, and has very important commercial value for industrial landing and popularization of automatic driving, so that three-dimensional target detection research based on images is still valued under the condition of relatively low performance. In view of the natural advantage of the laser radar point cloud over images in three-dimensional object detection tasks, a laser radar-based model can be utilized as a teacher model to guide a purely visual three-dimensional object detection model. For the implementation of cross-modal knowledge distillation (Knowledge Distillation), the general scheme is to simulate a teacher model in the feature extraction dimension, so as to attempt to obtain a better feature map and further improve the model performance, but due to the lack of depth information of the natural image and the huge difference between different modalities, effective high-dimensional features are often difficult to learn. Therefore, how to use a pure visual model to effectively learn only the high-quality scene features provided by the laser radar point cloud model from the image input and reasonably utilize the same to improve the final model performance is still a technical problem to be solved. Disclosure of Invention The invention provides a pure visual target detection method based on image-pseudo point cloud feature fusion, which aims to provide a high-quality feature map by using a laser radar point cloud model only in a pre-training stage through an effective pre-training scheme, so that a three-dimensional target detection model of pure visual input can learn to obtain pseudo point cloud features containing space information specific to a fixed point cloud, and fusion of image mode features and pseudo point cloud mode features is realized by utilizing a multi-mode feature fusion technology, so that the performance is remarkably improved on the basis of an original pure visual model. The technical scheme adopted by the invention is as follows: The variable subscripts "img" and "pc" are used to distinguish Image (Image) and Point Cloud (Point Cloud) branches, the variable subscript "bev" represents a Bird's-Eye View (Bird's-Eye View) form, the variable subscript "pseudo" is used to represent a pseudo-Point Cloud feature, and the variable subscript "fus" represents a cross-modal Fusion (Fusion) feature. A pure visual target detection method based on image-pseudo point cloud feature fusion comprises the following steps: Acquiring looking-around image data acquired by a vehicle-mounted camera, performing feature extraction and projection by using a BEV feature generation network based on an image to generate a three-dimensional aerial view feature of an image mode, and realizing an image-based target detection task by using an image target detection output head to pretrain a first model formed by the BEV feature generation network based on the image and the image target detection output head; Step (2), acquiring point cloud data acquired by a laser radar, generating a network based on BEV characteristics of the point cloud after data preprocessing to generate a three-dimensional aerial view characteristic of a point cloud mode, utilizing a point cloud target detection output head to realize a target detection task based on the point cloud, and pre-training a second model formed by the network generated based on the BEV characteristics of the point cloud and the point cloud target detection output head; Inputt