CN-121982462-A - Heterogeneous data fusion method, device and medium for automatic driving scene perception

CN121982462ACN 121982462 ACN121982462 ACN 121982462ACN-121982462-A

Abstract

The invention discloses a heterogeneous data fusion method, device and medium for automatic driving scene perception, which comprises the steps of carrying out feature enhancement on multi-source data through bidirectional priori interaction processing, then uniformly mapping the multi-source data to a bird's eye view space, utilizing a space-time self-adaptive gating mechanism to dynamically adjust the weight of multi-mode features, enabling a model to adaptively fuse geometric and semantic information in different time steps and different spaces, constructing a two-stage time diffusion model, generating rough BEV sketch semantic representation through a lightweight diffusion model under the condition of a downsampled version of the fused features in the first stage, capturing scene macroscopic layout, splicing the up-sampled BEV sketch semantic representation with original fused features in the second stage, forming enhanced condition features, and guiding a main diffusion model to generate a fine BEV perception result. The method effectively fuses heterogeneous sensor data, solves the problem of data mismatch in the traditional method, and remarkably improves the accuracy and the robustness of the automatic driving perception system.

Inventors

PAN CONG
JIN WENWEI

Assignees

南京航空航天大学

Dates

Publication Date: 20260505
Application Date: 20260109

Claims (8)

1. The heterogeneous data fusion method for automatic driving scene perception is characterized by comprising the following steps of: s0, respectively extracting image features by using a double-branch network And point cloud features Wherein Swin-T is used as an image backbone network, voxNet is used as a point cloud backbone network; s1, adopting Bi-directional prior interaction Bi-PI to image characteristics And point cloud features Performing bidirectional feature enhancement; S2, the enhanced image features And point cloud features Mapping to shared BEV space through view conversion and flattening operation respectively to obtain image BEV characteristics And point cloud BEV features ; S3, characterizing the image BEV Point cloud BEV features And time step encoding Performing characteristic splicing; s4, encoding BEV map true values by using variation self-encoder VAE model epsilon and using the same as original input of diffusion model ; S5, constructing a coarse-to-fine two-stage time diffusion model, wherein the first stage is used for generating a coarse BEV sketch semantic representation through a lightweight diffusion model on the condition of a downsampled version of the fusion feature To capture the macro layout of the scene, and a second stage to up-sample Splicing the primary fusion features to form enhanced condition features y', so as to guide the main diffusion model to generate a fine BEV perception result ; S6, using VAE model Obtaining a denoised BEV map from the diffusion model; S7, adopting a back propagation algorithm and a random gradient descent method to jointly optimize network parameters through multi-task loss, including reconstruction loss Sketch generation loss And hidden space cross-modal alignment loss ; S8, directly using BEV features fused by multi-source heterogeneous data as conditional feature representation in an inference stage, introducing random noise as input into a two-stage diffusion model to obtain a final denoised BEV map, and performing semantic segmentation task on the obtained BEV map to obtain final segmentation precision.
2. The heterogeneous data fusion method for autopilot scene awareness of claim 1 wherein the Bi-PI module includes an image prior enhancement module IPE and a voxel prior enhancement module VPE; in the IPE module, point cloud 3D coordinates are used Camera intrinsic I and camera extrinsic E, through a mapping layer Projecting the point cloud features to an image coordinate system to generate depth features The process is defined as Then using a full connection layer Depth characterization Conversion to depth embedding and image features Adaptive addition to obtain enhanced image features The process is defined as , wherein, Self-adapting weights for depth features; In the VPE module, 2D coordinates projected to an image coordinate system using point cloud features Obtaining adjacent image features Through the sampling layer Feature sampling is carried out, normalization is carried out through a sigmoid function, and point cloud priori weights are obtained The specific process is that Then the prior weight and the point cloud characteristic Multiplication to obtain enhanced point cloud features The process is that 。
3. The heterogeneous data fusion method for automatic driving scene perception according to claim 1, wherein the implementation process of step S3 is as follows: Spatially adaptive gating weights using a spatial attention network, including fully connected networks and sigmoid processing The fused conditional features are expressed as 。
4. The heterogeneous data fusion method for automatic driving scene perception according to claim 1, wherein the implementation process of step S4 is as follows: Step of time The noise level adjuster at is defined as Forward process of potential diffusion model Obtaining noise , Obeys a standard normal distribution , As the noise at the time t-1, Is the noise at the time t and is, Is a noise level adjuster at the time t, , The total number of time steps for the diffusion model.
5. The heterogeneous data fusion method for automatic driving scene perception according to claim 1, wherein the implementation process of step S5 is as follows: From noise Proceed to original input Is used for the iterative denoising reconstruction of (a), the process is defined as , , The total time steps of the diffusion model; In the first stage, the fusion condition feature y is processed through a downsampling layer to obtain a rough condition feature Constructing a lightweight sketch diffusion model with a denoising network of , Where N is the number of time steps of the first stage diffusion model, noise latent semantic representation obtained in a forward process For input to Under the condition, through N steps of denoising, a rough BEV sketch semantic representation of a scene macroscopic layout with low resolution is generated The objective function of the first stage is the reconstruction loss of the sketch: ; Wherein, the Is a special encoder in the field; In the second stage, the rough sketch output from the first stage Restoring to the original BEV spatial resolution through the Up-sampling layer Up, and splicing with the original fusion condition feature y in the channel dimension to form an enhanced condition feature The specific process is that Denoising network using main diffusion model To enhance conditional features Conditional, latent semantic representation of noise Executing a complete T-step denoising process, wherein the process objective function is as follows: ; representation of enhanced condition features by a cross-attention layer map during denoising The cross-attention layer is defined as the middle layer of the control information incorporated into UNet Where Q, K, V are queries, keys and values of the attention mechanism.
6. The heterogeneous data fusion method for automatic driving scene perception according to claim 1, wherein the implementation process of step S7 is as follows: Reconstruction loss I.e. calculate the difference between the BEV map truth value and the final denoised BEV map, sketch generation loss, i.e. step S5 first stage Hidden spatial cross-modal alignment loss Internal characterization to force model learning modality to be unchanged; First, constructing bias condition features, and constructing two condition features biased to a single mode in parallel, namely image dominant condition I.e. suppressing the contribution of the point cloud BEV features when computing gating weights, point cloud dominant conditions That is, when calculating gating weights, the contribution of the BEV features of the image is suppressed, and the specific process is as follows: ; ; Wherein, the And Coefficients that are near zero; within the same training batch, the fixed noise potential representation And time step t, respectively And Denoising network for inputting main diffusion model Extracting corresponding feature map in intermediate layer of UNet And Will (i) be And Mapping to the same vector space by a shared projection head h and calculating normalized temperature scaling cross entropy loss to maximize (h #) from the same scene ),h( ) Similarity between positive sample pairs while minimizing similarity between negative sample pairs with other samples in the batch, the loss being noted as The total training loss of the final model is: ; Wherein, the , , And obtaining a final data fusion and scene perception model after repeated iterative training.
7. A storage medium having stored thereon a computer program which, when executed by at least one processor, implements the steps of the heterogeneous data fusion method for autopilot awareness according to any one of claims 1 to 6.
8. An electronic device comprising a memory and a processor, wherein: a memory for storing a computer program capable of running on the processor; Processor for performing the steps of the heterogeneous data fusion method for autopilot scenario awareness according to any one of claims 1 to 6 when running the computer program.

Description

Heterogeneous data fusion method, device and medium for automatic driving scene perception Technical Field The invention relates to the field of computer vision and pattern recognition, in particular to a semantic segmentation and automatic driving method, and particularly relates to a heterogeneous data fusion method, device and medium for automatic driving scene perception. Background In recent years, an automatic driving multi-source perception and controllable generation method based on a diffusion model starts to get a lot of attention, and a certain progress is made. Through implicit characterization learning of a diffusion model and a probability-driven generation mechanism, joint modeling and scene perception of multi-mode data (laser radar, vision, semantic map, text and the like) are preliminarily realized. However, sensor data such as RGB images output by a camera and point clouds output by a laser radar have essential differences in dimension, resolution and physical meaning, and have isomerism in a 2D pixel space and a 3D geometric space, so that a representation gap exists between different modes. Meanwhile, the difference of environments such as different weather, illumination conditions, geographical areas and the like can cause significant deviation of heterogeneous data distribution, and a model is difficult to establish uniform feature expression. Finally, the accuracy of the sensing system in a complex environment is low, and the safety of automatic driving is threatened. The heterogeneous data fusion method based on the diffusion model and the automatic driving scene perception system can better solve the problem. Disclosure of Invention The invention aims to provide a heterogeneous data fusion method, device and medium for automatic driving scene perception, which are used for remarkably improving the perception accuracy and robustness of an automatic driving system in a complex traffic environment. The technical scheme is that the heterogeneous data fusion method for automatic driving scene perception comprises the following steps of: s0, respectively extracting image features by using a double-branch network And point cloud featuresWherein Swin-T is used as an image backbone network, voxNet is used as a point cloud backbone network; s1, adopting Bi-directional prior interaction Bi-PI to image characteristics And point cloud featuresPerforming bidirectional feature enhancement; S2, the enhanced image features And point cloud featuresMapping to shared BEV space through view conversion and flattening operation respectively to obtain image BEV characteristicsAnd point cloud BEV features; S3, characterizing the image BEVPoint cloud BEV featuresAnd time step encodingPerforming characteristic splicing; s4, encoding BEV map true values by using variation self-encoder VAE model epsilon and using the same as original input of diffusion model ; S5, constructing a coarse-to-fine two-stage time diffusion model, wherein the first stage is used for generating a coarse BEV sketch semantic representation through a lightweight diffusion model on the condition of a downsampled version of the fusion featureTo capture the macro layout of the scene, and a second stage to up-sampleSplicing the primary fusion features to form enhanced condition features y', so as to guide the main diffusion model to generate a fine BEV perception result; S6, using VAE modelObtaining a denoised BEV map from the diffusion model; S7, adopting a back propagation algorithm and a random gradient descent method to jointly optimize network parameters through multi-task loss, including reconstruction loss Sketch generation lossAnd hidden space cross-modal alignment loss; S8, directly using BEV features fused by multi-source heterogeneous data as conditional feature representation in an inference stage, introducing random noise as input into a two-stage diffusion model to obtain a final denoised BEV map, and performing semantic segmentation task on the obtained BEV map to obtain final segmentation precision. Further, the Bi-PI module comprises an image prior enhancement module IPE and a voxel prior enhancement module VPE; in the IPE module, point cloud 3D coordinates are used Camera intrinsic I and camera extrinsic E, through a mapping layerProjecting the point cloud features to an image coordinate system to generate depth featuresThe process is defined asThen using a full connection layerDepth characterizationConversion to depth embedding and image featuresAdaptive addition to obtain enhanced image featuresThe process is defined as, wherein,Self-adapting weights for depth features; In the VPE module, 2D coordinates projected to an image coordinate system using point cloud features Obtaining adjacent image featuresThrough the sampling layerFeature sampling is carried out, normalization is carried out through a sigmoid function, and point cloud priori weights are obtainedThe specific process is thatThen the prior weight and the point