CN-122024196-A - Target detection method based on point cloud and image fusion

CN122024196ACN 122024196 ACN122024196 ACN 122024196ACN-122024196-A

Abstract

The invention discloses a target detection method based on point cloud and image fusion, and belongs to the technical field of automatic driving multi-sensor fusion perception. According to the method, a double-stage route attention mechanism is introduced to carry out fine attention calculation on the basis of regional level screening to reduce complexity by constructing a reparameterized bottleneck module to realize multi-branch feature extraction in a training stage and single-path efficient calculation in an reasoning stage, and a detection framework without non-maximum suppression from end to end is adopted to reduce delay by fifteen percent. The three-dimensional point cloud is divided into two-dimensional cylinder grids by adopting a cylinder processing method aiming at the point cloud data, cylinder features are extracted through multi-layer perceptron coding and maximum pooling and then arranged into pseudo images, and multi-scale fusion is carried out by utilizing a weighted bidirectional feature pyramid, so that the computational complexity is reduced from three dimensions to two dimensions. Finally, the three-dimensional detection frame and the two-dimensional detection frame are matched and fused through a back projection algorithm, and the instantaneity is remarkably improved on the premise of guaranteeing the detection precision.

Inventors

HUANG LEI
SHEN LONG
WANG MINRAN
XU XINCHAO

Assignees

南京林业大学

Dates

Publication Date: 20260512
Application Date: 20260209

Claims (10)

1. The target detection method based on the point cloud and the image fusion is characterized by comprising the following steps of: step S1, acquiring two-dimensional image data and three-dimensional point cloud data of a scene to be detected; S2, inputting the two-dimensional image data into an image detection network, expanding an input channel through convolution operation, dividing the input channel into a main branch and an auxiliary branch, sequentially carrying out convolution processing on the main branch through a bottleneck module, directly transmitting characteristics by the auxiliary branch, carrying out channel splicing and convolution on the output characteristics of the main branch and the output characteristics of the auxiliary branch, and outputting an image characteristic diagram; S3, dividing the image feature map into a preset number of areas, pooling each area to obtain area vectors, calculating the similarity between the area vectors, determining a target area set for each area according to the similarity, calculating attention weights in the target area set, weighting and summing, outputting an attention feature map, and detecting to obtain a two-dimensional detection frame and a target category based on the attention feature map; s4, calculating a cylinder index of each point cloud according to the coordinate range and the cylinder size of the three-dimensional point cloud data, forming a cylinder point set by the point clouds with the same cylinder index, performing multi-layer perceptron coding and maximum pooling on each cylinder point set, and outputting cylinder feature vectors; S5, arranging the column feature vectors into pseudo image features according to the column indexes, performing multi-scale convolution and weighted fusion on the pseudo image features, and obtaining a three-dimensional detection frame based on fusion feature detection; and S6, projecting the vertex coordinates of the three-dimensional detection frame to an image plane according to the external reference matrix of the radar camera and the internal reference matrix of the camera, determining a corresponding relation according to the overlapping degree of the projection coordinates and the two-dimensional detection frame, and outputting the target category and the three-dimensional coordinates of the three-dimensional detection frame.
2. The target detection method based on point cloud and image fusion according to claim 1, wherein the process of expanding the input channel by convolution operation in step S2 is: By convolving a kernel of size The convolution operation of (2) extends the number of input channels of the two-dimensional image data to twice the number of original channels, i.e. to Wherein Is the number of input channels, the main branch comprises A repeating bottleneck module, wherein For the repetition number of bottleneck modules, the forward propagation formula of the key bottleneck module is as follows: (1) Wherein the method comprises the steps of Representing the state of a short-circuit connection, Representing a non-short circuit connection state, A feature map representing the input is provided, Representing an intermediate variable of the output of the neural network, Representing the first convolution operation of the first time, Representing a second convolution operation, and finally passing the convolution kernel size to be The final convolution of (2) compresses the number of feature channels after fusion into Wherein Is the number of output channels.
3. The target detection method based on point cloud and image fusion according to claim 1, wherein the process of dividing the image feature map into a preset number of regions in step S3 is as follows: for input feature map Divided into A non-overlapping region in which Representing the input characteristic map of the object, For the height of the feature map, For the width of the feature map, As the number of channels of the feature map, For the number of side lengths of the region division, For the total number of the pixel points of the feature image, each region comprises Remodeling the feature points into a series of region representations: (2) Wherein the method comprises the steps of Represents the first The characteristic representation of the individual regions is such that, For the region index to be used, Is the total number of regions.
4. The target detection method based on point cloud and image fusion according to claim 3, wherein the process of calculating the similarity between the region vectors in the step S3 is: For each region, a region-level vector is obtained using averaging pooling And (2) and Wherein Represents the first Calculating a correlation matrix between the regions And (2) and Wherein Representing the region-dependent matrix of the region, Represents the first Region(s) and (ii) Correlation scores between individual regions are calculated as: (3) Wherein the method comprises the steps of Represents the first The vector of the individual regions is used to determine, Represents the first Transpose of the vectors of the individual regions, For the feature dimensions to be used for normalization, Is an index of another region.
5. The method for detecting a target based on point cloud and image fusion according to claim 4, wherein the process of calculating attention weights and weighting and summing in the target area set in step S3 is as follows: for each query region Only the highest correlation is kept Key regions forming a set of related regions Wherein In order to preserve the number of critical areas, Represents the first A set of related regions of the query regions, a set of related regions selected from the set of related regions Standard attention calculations were performed inside: (4) Wherein the method comprises the steps of Representing query regions Is provided with a plurality of query tokens, Representing a set from a set of related regions Is used for the preparation of a key, Representing a set from a set of related regions Is used as a reference to the value of (a), Representing a transpose of the key matrix, Representing normalized exponential functions, this step taking the computational complexity from To fall to 。
6. The target detection method based on point cloud and image fusion according to claim 1, wherein the process of calculating the cylinder index of each point cloud according to the coordinate range and the cylinder size of the three-dimensional point cloud data in step S4 is as follows: inclusion for input point clouds A plurality of points, wherein For the total number of point clouds, the attribute contains location information 、、 And intensity information, wherein Is at the point of The coordinate value of the axial direction, Is at the point of The coordinate value of the axial direction, Is at the point of Simultaneously calculating attribute information of training point cloud including point cloud Minimum value of direction 、 Maximum value of direction 、 Minimum value of direction 、 Maximum value of direction And Minimum value of direction 、 Maximum value of direction Setting the size and width of the column Height of Parameters and parameters Wherein Is a column body The width of the direction is set to be the same, Is a column body The height of the direction is set to be equal to the height of the direction, The depth parameter of the column is usually set to be 1, and the index of each point cloud in the column is calculated Wherein Is pointed in a cylinder grid Index of the direction(s), Is pointed in a cylinder grid The index of the direction is calculated as follows: (5)。
7. The method for detecting the object based on the point cloud and the image fusion according to claim 6, wherein the process of performing the multi-layer perceptron coding and the maximum pooling on each cylinder point set in the step S4 is as follows: indexing the point cloud of the column, extracting the characteristics of the point cloud in each column, and arranging the column with A plurality of points, wherein The number of points in a single column, each point characterized by: (6) Wherein the method comprises the steps of Represents the first The feature vector of the individual points is used, For the index of the point(s), Respectively represent the first The absolute coordinates of the individual points are used, Represents the first The intensity of the reflection at the individual points, Represents the first With points in relation to the centre of the column The amount of the offset in the direction is, Represents the first With points in relation to the centre of the column The column feature extraction formula is as follows: (7) Wherein the method comprises the steps of Representing the feature vector of the column, Is an encoder of a multi-layer perceptron, Representative pair of The individual point features are subjected to multi-layer perceptron coding, For maximum pooling operations, Respectively represent the 1 st point to the 1 st point in the column Characteristics of the individual dots.
8. The method for detecting an object based on point cloud and image fusion according to claim 1, wherein the step S5 is characterized in that the process of arranging the pillar feature vector as a pseudo image feature according to the pillar index comprises: The cylinder features are rearranged into a two-dimensional pseudo image as follows: (8) Wherein the method comprises the steps of Representing the position of the pseudo-image Is used for the characteristic value of the (c), Representative is located at grid coordinates Is defined by the column feature vector of (a), Is the two-dimensional grid coordinates of the cylinder.
9. The target detection method based on point cloud and image fusion according to claim 1, wherein the process of performing multi-scale convolution and weighted fusion on the pseudo image features in step S5 is as follows: different scale feature weighting fusion is carried out by adopting a weighted bidirectional feature pyramid, and the enhanced information flow is connected in a jumping manner Calculating output characteristics The method comprises the following steps: (9) Wherein the method comprises the steps of Represents the first The output characteristics of the layer are such that, For the hierarchical index of the feature pyramid, Represents the first The input characteristics of the layer(s), Represents the first The output characteristics of the layer that is the upper layer, Represents the first The input features of a layer that is the next layer, In order for the weights to be a learnable weight, The operations for resizing include upsampling or downsampling, The small constant for preventing the numerical instability is set to 0.0001, Representing convolution operation, after improvement, not only high-level semantic information can be fused, but also bottom-layer detail information can be reserved, and the computational complexity before optimization is as follows: (10) Wherein the method comprises the steps of Representing the computational complexity of the three-dimensional convolution, As the height of the feature map, For the width of the feature map, As the dimension in the depth direction, In order to input the number of channels, In order to output the number of channels, For the size of the convolution kernel, A volume that is a three-dimensional convolution kernel; The optimized calculation complexity is as follows: (11) Wherein the method comprises the steps of Representing the computational complexity of the two-dimensional convolution, Is the area of the two-dimensional convolution kernel; The complexity reduction ratio is: (12) Wherein the method comprises the steps of Representing a reduced proportion of computational complexity.
10. The method for detecting the object based on the point cloud and the image fusion according to claim 1, wherein the process of projecting the vertex coordinates of the three-dimensional detection frame to the image plane according to the external reference matrix of the radar camera and the internal reference matrix of the camera in step S6 is as follows: in the radar coordinate system, a three-dimensional bounding box is represented by 7 parameters: (13) Wherein the method comprises the steps of A set of parameters representing a three-dimensional bounding box, Representing the center of the frame at The coordinates of the direction of the axis, Representing the center of the frame at The coordinates of the direction of the axis, Representing the center of the frame at The coordinates of the direction of the axis, Width correspondence representing bounding box In the direction of the axis of the shaft, Height correspondence representing bounding box In the direction of the axis of the shaft, Representing the length correspondence of the bounding box In the direction of the axis of the shaft, Representing yaw angle windings The rotation angle of the shaft, 8 vertexes of the three-dimensional frame in the local coordinate system The method comprises the following steps: (14) Wherein the method comprises the steps of Representing 8 vertex coordinate matrices in a local coordinate system, superscript Representing matrix transposition; Setting a rotation matrix : (15) Wherein the method comprises the steps of Representative winding The rotation matrix of the shaft is used, As the cosine of the yaw angle, Is a sine value of the yaw angle; after transformation to radar coordinate system, vertex The specific calculation formula is as follows: (16) Wherein the method comprises the steps of Representing 8 vertex coordinates in the radar coordinate system, Representing the rotation of the local vertices by the rotation matrix, A translation vector representing the center of the frame; the conversion flow from the radar coordinate system to the camera coordinate system is as follows: (17) Wherein the method comprises the steps of Representing the vertex coordinates in the camera coordinate system, The transformation matrix from radar to camera contains rotation and translation parameters, and the three-dimensional detection frame can be converted into a two-dimensional coordinate system to obtain the type and three-dimensional position of the object.

Description

Target detection method based on point cloud and image fusion Technical Field The invention belongs to the technical field of automatic driving multi-sensor fusion perception, and particularly relates to a target detection method based on point cloud and image fusion. Background In the context of rapid development of autopilot technology, the performance of an environmental awareness system, which is a core module for implementing safety decisions, directly affects the driving safety and reliability of a vehicle. The currently mainstream sensing scheme mainly relies on two types of sensors, namely a laser radar and a camera, but the single sensor scheme has inherent defects. The camera-based visual detection method can acquire abundant texture and color information, is excellent in target classification and semantic recognition, but is used for essentially projecting a three-dimensional scene to a two-dimensional plane, so that accurate depth information is difficult to directly acquire, and the three-dimensional positioning accuracy is insufficient. The laser radar-based point cloud detection method can directly measure the three-dimensional geometric structure of the target and provide accurate distance and position information, but the point cloud data is sparse and lacks texture semantics, and the recognition capability of the remote small target is limited. Existing multisensor fusion schemes, while attempting to combine the advantages of both sensors, still face challenges in practical applications. The traditional post-fusion method is characterized in that target detection is independently carried out on two sensors respectively, then matching fusion is carried out on a result level, and the method is simple to realize, but cannot fully utilize complementary information among multi-mode data, so that the fusion effect is limited. In a fusion detection algorithm based on a bird's-eye view angle, which appears in recent years, point clouds and images are projected onto a bird's-eye view space in a unified way to perform feature fusion, and although certain progress is made, the calculation complexity is high, the real-time frame rate is only 5 to 6 frames per second on a conventional vehicle-mounted calculation platform, and the real-time requirement is difficult to meet. In addition, the existing detection network also has efficiency problems in the feature extraction and fusion stage. Although the traditional convolutional neural network can obtain good characteristic expression capability in a training stage, the calculation cost in an inference stage is high, and about fifteen percent of extra delay is caused by particularly non-maximum suppression loop. Three-dimensional sparse convolution is generally adopted in the three-dimensional point cloud detection network for feature extraction, sparse feature images are required to be stored and calculated in a three-dimensional space, memory occupation is large, and calculation efficiency is low. The factors jointly lead the existing fusion detection scheme to be difficult to meet the real-time requirement while ensuring the detection precision, and restrict the large-scale deployment application of the fusion detection scheme in an actual automatic driving system. Disclosure of Invention The invention aims to overcome the defects in the prior art, and provides a target detection method based on point cloud and image fusion, which optimizes an image detection network by constructing a lightweight heavy parameterization bottleneck module and a two-stage route attention mechanism, reduces three-dimensional detection calculation complexity by adopting a columnar point cloud processing method and a weighted bidirectional feature pyramid, and realizes real-time fusion detection without non-maximum suppression from end to end, thereby obviously improving processing speed on the premise of ensuring detection precision and solving the problems of insufficient instantaneity and limited fusion effect of the existing scheme. In order to achieve the technical purpose, the invention adopts the following technical scheme: A target detection method based on point cloud and image fusion comprises the following steps: step S1, acquiring two-dimensional image data and three-dimensional point cloud data of a scene to be detected; S2, inputting the two-dimensional image data into an image detection network, expanding an input channel through convolution operation, dividing the input channel into a main branch and an auxiliary branch, sequentially carrying out convolution processing on the main branch through a bottleneck module, directly transmitting characteristics by the auxiliary branch, carrying out channel splicing and convolution on the output characteristics of the main branch and the output characteristics of the auxiliary branch, and outputting an image characteristic diagram; S3, dividing the image feature map into a preset number of areas, pooling each area to