CN-121982292-A - Target detection method and system based on multi-mode fusion and vehicle

CN121982292ACN 121982292 ACN121982292 ACN 121982292ACN-121982292-A

Abstract

The embodiment of the application provides a target detection method, a target detection system and a vehicle based on multi-mode fusion, wherein the method comprises the steps of acquiring image data and point cloud data in a traffic scene graph to be identified; the method comprises the steps of carrying out feature extraction on image data by adopting an image feature extraction branch network to obtain image feature data, carrying out feature extraction on point cloud data by adopting a point cloud feature extraction branch network to obtain point cloud feature data, carrying out feature fusion on the image feature data and the point cloud feature data based on an attention mechanism to obtain fusion features, and detecting the fusion features by adopting a detection network to obtain a detection result, wherein the detection network is used for processing multi-scale and multi-mode target detection tasks. The application solves the technical problems of limited information capacity of single-mode images and poor capability of adapting to complex scenes in the related technology.

Inventors

YUAN LI
LI PENGLONG
Huo Hongming

Assignees

奇瑞汽车股份有限公司

Dates

Publication Date: 20260505
Application Date: 20260129

Claims (10)

1. The target detection method based on multi-mode fusion is characterized by comprising the following steps of: Acquiring image data and point cloud data in a traffic scene graph to be identified; carrying out feature extraction on the image data by adopting an image feature extraction branch network to obtain image feature data; Performing feature extraction on the point cloud data by adopting a point cloud feature extraction branch network to obtain point cloud feature data; Performing feature fusion on the image feature data and the point cloud feature data based on an attention mechanism to obtain fusion features; And detecting the fusion characteristics by adopting a detection network to obtain a detection result, wherein the detection network is used for processing multi-scale and multi-mode target detection tasks.
2. The method of claim 1, wherein the image feature extraction branch network comprises a backbone network and a neck network, wherein the feature extraction of the image data by using the image feature extraction branch network to obtain image feature data comprises: Extracting features of the image data through the backbone network to obtain multi-scale feature representation; and adjusting the multi-scale feature representation through the neck network to obtain the image feature data.
3. The method of claim 1, wherein the point cloud feature extraction branch network comprises a voxelized network, a three-dimensional convolution network and a bird's eye view conversion network, wherein the feature extraction of the point cloud data by using the point cloud feature extraction branch network to obtain point cloud feature data comprises: Voxelized the point cloud data through the voxelized network to obtain regularly distributed point cloud data; performing feature extraction on the regularly distributed point cloud data through the three-dimensional convolution network to obtain voxel features; And converting the voxel characteristic into a two-dimensional characteristic image through the aerial view conversion network to obtain the point cloud characteristic data.
4. The method of claim 1, wherein the attention mechanism comprises a channel attention mechanism and a spatial attention mechanism, wherein the feature fusion of the image feature data and the point cloud feature data based on the attention mechanism to obtain a fused feature comprises: Carrying out average pooling processing on the image characteristic data and the point cloud characteristic data through the channel attention mechanism to obtain channel attention weight; carrying out convolution processing on the image characteristic data and the point cloud characteristic data through the spatial attention mechanism to obtain spatial attention weight; and determining the fusion characteristic through the channel attention weight, the space attention weight, the image characteristic data and the point cloud characteristic data.
5. The method of claim 4, wherein the determining the fusion feature from the channel attention weight, the spatial attention weight, the image feature data, and the point cloud feature data comprises: multiplying the image characteristic data and the point cloud characteristic data channel by channel according to the channel attention weight to obtain a channel weighted characteristic diagram; Multiplying the image characteristic data and the point cloud characteristic data pixel by pixel according to the spatial attention weight to obtain a spatial weighted characteristic diagram; And carrying out element level fusion on the channel weighted feature map and the space weighted feature map to obtain the fusion feature.
6. The method of claim 1, wherein the detection result comprises a target class, a bounding box position, and a detection confidence, and the detecting the fusion feature by using a detection network comprises: And detecting the fusion characteristic through the detection network to obtain the target category of at least one target, the boundary box position of the at least one target and the detection confidence of the at least one target.
7. The method of claim 6, wherein the method further comprises: screening the detection confidence corresponding to the at least one target according to a confidence threshold; Responding to any detection confidence coefficient smaller than the confidence coefficient threshold value, and eliminating any target corresponding to the detection confidence coefficient smaller than the confidence coefficient threshold value; And in response to any detection confidence coefficient being greater than or equal to the confidence coefficient threshold, marking a target corresponding to any detection confidence coefficient being greater than or equal to the confidence coefficient threshold on the traffic scene graph to be identified.
8. A target detection system based on multi-modal fusion, comprising: The acquisition module is used for acquiring image data and point cloud data in the traffic scene graph to be identified; the first extraction module is used for carrying out feature extraction on the image data by adopting an image feature extraction branch network to obtain image feature data; the second extraction module is used for carrying out feature extraction on the point cloud data by adopting a point cloud feature extraction branch network to obtain point cloud feature data; The fusion module is used for carrying out feature fusion on the image feature data and the point cloud feature data based on an attention mechanism to obtain fusion features; The detection module is used for detecting the fusion characteristics by adopting a detection network to obtain a detection result, wherein the detection network is used for processing multi-scale and multi-mode target detection tasks.
9. A vehicle, characterized by comprising: a memory storing an executable program; A processor for executing the executable program, wherein the executable program when run on the processor performs the multimodal fusion based object detection method of any of the preceding claims 1 to 7.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program, wherein the computer program is arranged to perform the multimodal fusion based object detection method as defined in any of the preceding claims 1 to 7 when run on a computer or processor.

Description

Target detection method and system based on multi-mode fusion and vehicle Technical Field The embodiment of the application relates to the technical field of image recognition, in particular to a target detection method and system based on multi-mode fusion and a vehicle. Background In intelligent traffic systems and automatic driving technologies, target detection is a key link for realizing road safety, traffic management and car navigation. Static elements in actual traffic scenes, such as traffic signs, guardrails, etc., have a high degree of similarity to the local features of the vehicle. In addition, under the conditions of dense pedestrians and traffic congestion, the phenomenon of mutual shielding among targets is quite common. The prior art mainly relies on single-mode data to carry out target detection, has poor adaptability to the complex scene, and can possibly generate the condition of missed detection or misclassification. There is currently no good solution to the above problems. Disclosure of Invention The embodiment of the application provides a target detection method, a target detection system and a target detection vehicle based on multi-mode fusion, which at least solve the technical problems of limited single-mode image information capacity and poor adaptability to complex scenes in the related technology. According to one aspect of the embodiment of the application, a target detection method based on multi-mode fusion is provided, which comprises the steps of obtaining image data and point cloud data in a traffic scene graph to be identified, carrying out feature extraction on the image data by adopting an image feature extraction branch network to obtain image feature data, carrying out feature extraction on the point cloud data by adopting the point cloud feature extraction branch network to obtain point cloud feature data, carrying out feature fusion on the image feature data and the point cloud feature data based on an attention mechanism to obtain fusion features, and detecting the fusion features by adopting a detection network to obtain a detection result, wherein the detection network is used for processing multi-scale and multi-mode target detection tasks. The image feature extraction branch network comprises a backbone network and a neck network, wherein the image feature extraction branch network is used for carrying out feature extraction on image data to obtain image feature data, the image feature extraction branch network comprises the steps of carrying out feature extraction on the image data through the backbone network to obtain multi-scale feature representation, and the neck network is used for adjusting the multi-scale feature representation to obtain the image feature data. The point cloud feature extraction branch network comprises a voxelized network, a three-dimensional convolution network and a bird's eye view conversion network, wherein the point cloud feature extraction branch network is used for carrying out feature extraction on point cloud data to obtain point cloud feature data, the voxelized network is used for voxelizing the point cloud data to obtain regularly distributed point cloud data, the three-dimensional convolution network is used for carrying out feature extraction on the regularly distributed point cloud data to obtain voxel features, and the bird's eye view conversion network is used for converting the voxel features into a two-dimensional feature map to obtain the point cloud feature data. The attention mechanism comprises a channel attention mechanism and a space attention mechanism, the image feature data and the point cloud feature data are subjected to feature fusion based on the attention mechanism to obtain fusion features, the image feature data and the point cloud feature data are subjected to average pooling processing through the channel attention mechanism to obtain channel attention weights, the image feature data and the point cloud feature data are subjected to convolution processing through the space attention mechanism to obtain space attention weights, and the fusion features are determined through the channel attention weights, the space attention weights, the image feature data and the point cloud feature data. The method comprises the steps of obtaining a channel weighted feature map by multiplying the image feature data and the point cloud feature data channel by channel according to the channel attention weight, obtaining a space weighted feature map by pixel according to the space attention weight, and obtaining a fusion feature by element level fusion of the channel weighted feature map and the space weighted feature map. Further, the detection result comprises a target category, a boundary box position and a detection confidence coefficient, and the fusion characteristic is detected by adopting a detection network to obtain the detection result. The target detection method based on the multi-mode fusion further c