CN-121982699-A - Physical perception and heat map guided 4D radar camera fusion 3D detection method

CN121982699ACN 121982699 ACN121982699 ACN 121982699ACN-121982699-A

Abstract

The invention discloses a 4D radar-camera collaborative sensing method based on a multidimensional physical sensing diffusion and heat map guiding alignment mechanism, which comprises the steps of respectively extracting initial features from an input image and 4D radar point clouds through an image encoder and a point cloud processing module, utilizing a self-adaptive diffusion mechanism of a radar auxiliary feature extractor to strengthen radar point cloud density of a region of interest, fusing radar features with image features through an early fusion network to make up a short plate of an image on space and dynamic information, establishing a cross-modal feature corresponding relation by a heat map guiding feature aggregator through a double-branch deformable attention, realizing layered feature integration of channels and space layers through a normalization-feedforward network-normalization, convolution-activation function-normalization assembly, inputting the fused multi-modal features into a detection head, and outputting an accurate 3D boundary frame and category prediction. The method can perform more accurate and robust 3D detection on the target in a complex automatic driving environment.

Inventors

REN KEYAN
WANG SHIHAO
DU YONGPING
ZHU WENZHUO

Assignees

北京工业大学

Dates

Publication Date: 20260505
Application Date: 20260126

Claims (4)

1. A physical perception and heat map guided 4D radar camera fusion 3D detection method, characterized in that the method comprises the steps of: S1, constructing Radar-assisted image feature extraction and fusion of physical perception adaptive diffusion, namely firstly constructing a classification decision tree based on joint reasoning of a reflection section, depth and speed amplitude of Radar point cloud, dynamically dividing the Radar point cloud into high-confidence static and dynamic categories so as to establish a physical prior of space enhancement, then performing cross-dimension weight joint modulation according to a motion vector and Radar cross-section (RCS), and performing three-dimensional space isotropy field evolution under the guidance of a view field; S2, constructing a dual-flow deformable interactive alignment fusion mechanism (HFA) based on heat map guidance, namely firstly, utilizing a two-dimensional heat map priori generated by full convolution to extract a region-of-interest mask through channel extremum pooling and self-adaptive threshold processing, constructing a space constraint mask under a three-dimensional voxel space through implicit field projection, obtaining a saliency map of a current sample on the basis, constructing a reverse suppression tensor according to the saliency map, carrying out amplitude truncation or smoothing processing on gradients flowing to a background region by utilizing the tensor in a reverse propagation stage of model training, and finally, establishing a two-way mutual query branch of an image and a radar, integrating a deformable sampling strategy, and utilizing a leavable offset self-adaptive reconstruction sampling point topology to dynamically correct non-rigid dislocation among heterogeneous modes on a characteristic manifold so as to obtain complementary characteristics of geometric calibration and semantic enhancement; S3, dual normalization feedforward and convolution activated multimode fusion feature enhancement is carried out, namely firstly, input distribution is remodeled to zero mean unit variance by means of pre-normalization, a nonlinear feedforward network projects features to Gao Weiyin space through a learnable linear transformation matrix, feature refinement of an output channel is achieved through variance boundaries of residual difference streams limited by post-normalization, then output features of independent modes are integrated into unified tensor entities through dense multiplexing along a channel axis to achieve depth parallel of complementary semantic streams and geometric streams, finally, local neighborhood correlation in a feature map is captured by means of a small window space filter, a statistical distribution alignment and nonlinear response rectification mechanism is introduced, layer-by-layer dynamic calibration is carried out on filtered feature responses, and the feature map with semantic enhancement is output for a detection head to decode 3D perception results.
2. The physical perception and heat map guidance-based 4D radar camera fusion 3D detection method according to claim 1, wherein the specific flow of step S1 is as follows: S11, constructing Lei Dadian cloud self-adaptive projection scattering, wherein the self-adaptive projection scattering comprises the following structure: Dynamically calculating a diffusion radius for characterizing the geometrical uncertainty of the target based on the physical properties of each radar point The formulation is as follows: ; Wherein, the As a function of the base radius, Representing motion-aware items, pass speed magnitude Adjusting the radius to compensate for displacement deviation of the dynamic target in the sensing period; representing a reflection confidence coefficient item, and defining a geometric space range of the target by using RCS intensity; s12, constructing a radar-assisted early fusion network, wherein the structure is as follows: The method comprises the steps of carrying out an enhancement strategy on an original image tensor to obtain enhanced image characteristics, carrying out zero mean unit variance normalization, coding a projection radar point cloud set into an exclusive auxiliary channel tensor, carrying out zero filling on an echo-free area, then carrying out geometric transformation with synchronization, finally adopting a full convolution coding and decoding network, converting an image context characteristic image into a camera frustum view, carrying out aggregation and dimension reduction along a height axis, carrying out deep splicing, inputting ResNet to obtain coding characteristics, carrying out deconvolution operation by a decoder to generate a target thermodynamic diagram, and introducing multi-mode semantic alignment loss and physical constraint loss to realize deep coupling of visual perception and radar physical characteristics.
3. The method for 3D detection of fusion of a 4D radar camera based on physical perception and heat map guidance according to claim 1, wherein the heat map guided double-flow deformable interactive alignment fusion mechanism in step S2 specifically comprises the following sub-steps: s21, constructing a learnable position coding module of a heat map guiding radar and image bimodal feature level fusion network, wherein the structure is as follows: The method comprises the steps of performing downsampling on an original 2D thermodynamic diagram through lightweight spatial distillation, compressing redundant spatial information while preserving core features of target confidence distribution, mapping continuous confidence into a binary matrix through confidence-driven binary mask purification, accurately anchoring a high-value target area, and finally expanding a plane mask into a cube tensor through dimension-crossing structure dimension-increasing operation ) The structural migration of visual information to a sensing space is realized; S22, constructing a double-flow cross-mode alignment module, wherein the structure is as follows: Masking is injected into the attention calculation process, false alarm points are filtered by applying weight inhibition to invalid background area, a deformation attention mechanism is adopted, and a leachable sampling offset is utilized Space offset generated by sensor installation or synchronization deviation is captured and corrected, and radar and camera characteristics are realized ) The pixel level alignment has obtained a multi-modal feature level fusion feature formulated as: ; ; Wherein, the In order for the query to be embedded, As a point of reference to the reference, For the feature to be aligned, M is the number of attention heads, and K is the number of sampling points corresponding to each query; for attention weight, specifically represent the normalized attention weight of the kth query to the kth sample point in the mth header; The feature projection matrix to be aligned is the m-th head; an output projection matrix for the mth head; Is a learned sampling offset; Features derived for a dual-stream deformable interactive alignment fusion mechanism.
4. The 3D target detection method based on adaptive collaborative fusion of a 4D radar and a camera according to claim 1, wherein the dual normalized feedforward and convolution activated multi-modal fusion feature enhancement in step S3 specifically is: s31, constructing a nonlinear feature evolution model based on probability gating, wherein the structure is as follows: first, for an input tensor Performing dynamic distribution calibration, eliminating internal covariate offset by layer normalization, introducing a smooth gating mechanism based on a standard normal distribution cumulative function through a feedforward network, performing high-dimensional nonlinear screening and enhancement on the characteristics, and finally adopting residual connection to fuse an original information flow and performing secondary normalization to restrict an output dynamic range so as to realize depth reconstruction and stable output of the characteristics while inhibiting gradient; S32, constructing a space-time characteristic strengthening module, wherein the structure is as follows: Adopting a CBR (Convolution-BN-ReLU) module comprising 3x3 convolution, batch normalization and convolution activation to further aggregate the space neighborhood information and extract a multi-modal fused BEV feature map rich in semantic information and geometric consistency; The multi-mode fusion BEV feature map rich in semantic information and geometric consistency is input into a subsequent detection head through step S3, a classification head generates heat map probability distribution through calculation to identify and determine the category of a target in a 3D space, a regression head predicts 3D boundary frame parameters of each candidate area through a convolution layer to achieve accurate regression of the attributes of the target position, the size, the orientation, the movement speed and the like, and finally, a non-maximum suppression post-processing algorithm is utilized to remove overlapped redundant detection frames, so that a 3D target detection result under a complex traffic scene is finally obtained.

Description

Physical perception and heat map guided 4D radar camera fusion 3D detection method Technical Field The invention relates to the technical field of computer vision, in particular to a physical perception and heat map guided 4D radar camera fusion 3D detection method. Background With the continuous development of automatic driving technology, the environment sensing system is used as a foundation stone for vehicle decision planning, and the accuracy and the robustness of the environment sensing system are of great importance. Among the many environmental awareness tasks, vision-based 3D object detection is of great interest because it can provide rich semantic information. However, the camera sensor is inherently limited, such as being susceptible to light, weather, etc., and the two-dimensional image obtained by the camera sensor lacks depth information, so that the ranging and positioning accuracy in the three-dimensional space is difficult to meet the requirement of high-order automatic driving. To overcome the drawbacks of the purely visual approach, multi-sensor fusion techniques have evolved. Among them, a 4D millimeter wave radar (4D radar for short) is an ideal sensor complementary to a camera because it can provide a high-density point cloud containing distance, azimuth, elevation and speed information and has excellent resistance to bad weather. By fusing the point cloud data of the 4D radar and the image data of the camera, advantage complementation is expected to be realized, and more robust and more accurate 3D detection performance is obtained. However, existing radar and camera fusion methods still have significant limitations. First, many methods employ simple static fusion strategies, such as data stitching at the input stage or linear weighted fusion at the feature stage. The method fails to fully consider the deep association of two modal features in terms of semantics and space, ignores the dynamic interaction and self-adaptive complementarity between radar and camera data, causes poor fusion effect, and is difficult to deal with complex and changeable driving scenes. Secondly, the processing of the modal features by the existing method is often relatively isolated, an effective cross-modal feature interaction mechanism cannot be established, valuable information from different modes cannot be adaptively calibrated and enhanced, and the upper limit of the perception capability of the model is limited. Therefore, a new framework capable of realizing the depth and self-adaptive collaborative fusion of radar and camera features is developed, and the method has urgent demands and important significance for improving the 3D target detection performance of an automatic driving system in a real open environment. Disclosure of Invention In order to overcome the defects of the prior art, the invention provides a physical perception and heat map guided 4D radar camera fusion 3D detection method. The technical scheme adopted by the invention is as follows: a physical perception and heat map guided 4D radar camera fusion 3D detection method comprises the following steps: S1, constructing Radar-assisted image feature extraction and fusion of physical perception adaptive diffusion, namely firstly constructing a classification decision tree based on joint reasoning of a reflection section, depth and speed of Radar point cloud, dynamically dividing the Radar point cloud into high-confidence static and dynamic categories so as to establish a physical prior of space enhancement, then performing cross-dimension weight joint modulation according to a motion vector and Radar cross-section (RCS), and performing three-dimensional space isotropy field evolution under the guidance of a visual field; S2, constructing a dual-flow deformable interactive alignment fusion mechanism (HFA) based on heat map guidance, namely firstly, utilizing a two-dimensional heat map priori generated by full convolution to extract a region-of-interest mask through channel extremum pooling and self-adaptive threshold processing, constructing a space constraint mask under a three-dimensional voxel space through implicit field projection, obtaining a saliency map of a current sample on the basis, constructing a reverse suppression tensor according to the saliency map, carrying out amplitude truncation or smoothing processing on gradients flowing to a background region by utilizing the tensor in a reverse propagation stage of model training, and finally, establishing a two-way mutual query branch of an image and a radar, integrating a deformable sampling strategy, and utilizing a leavable offset self-adaptive reconstruction sampling point topology to dynamically correct non-rigid dislocation among heterogeneous modes on a characteristic manifold so as to obtain complementary characteristics of geometric calibration and semantic enhancement; S3, dual normalization feedforward and convolution activated multimode fusion feature enhancement is