CN-121982422-A - Target detection method and system integrating multi-mode progressive perception

CN121982422ACN 121982422 ACN121982422 ACN 121982422ACN-121982422-A

Abstract

The invention provides a target detection method and a system integrating multi-mode progressive perception, which relate to the technical field of image processing and comprise the steps of acquiring 3D laser radar data and 2D image data containing a target to be detected; the method comprises the steps of giving semantic tags of corresponding positions in 2D image data after Gaussian filtering processing to each point of 3D laser radar data based on external parameter calibration parameters of a camera and the laser radar to obtain 3D point cloud mark data, respectively carrying out feature extraction and fusion on a 3D volume map corresponding to the 2D image data and a voxel map corresponding to the 3D point cloud mark data to obtain a first fusion feature, obtaining a first boundary frame of a target to be detected in the 3D volume map and a second boundary frame of the target to be detected in the first fusion feature, and merging the first boundary frame and the second boundary frame meeting preset conditions to obtain a target detection result.

Inventors

ZHANG YINGDA
LI YUANMING
XIAO HAOYANG
SU HAOTIAN
XU FAN

Assignees

武汉理工大学

Dates

Publication Date: 20260505
Application Date: 20260206

Claims (10)

1. The target detection method integrating multi-mode progressive perception is characterized by comprising the following steps of: Acquiring 3D laser radar data and 2D image data containing an object to be detected; Based on external parameter calibration parameters of a camera and a laser radar, giving semantic tags at corresponding positions in the 2D image data after Gaussian filtering processing to each point of the 3D laser radar data to obtain 3D point cloud marking data; Acquiring a 3D volume map corresponding to the 2D image data and a voxel map corresponding to the 3D point cloud mark data, respectively carrying out feature extraction on the 3D volume map and the voxel map to respectively obtain a 3D volume map feature and a voxelized feature; And performing 2D convolution operation on the 3D volume map to obtain a first boundary frame of the target to be detected, performing 3D convolution operation on the first fusion feature to obtain a second boundary frame of the target to be detected, determining consistency parameters of category labels between the second boundary frame and the first boundary frame after the second boundary frame is shot to a 2D plane, and merging the first boundary frame and the second boundary frame, wherein the consistency parameters meet preset conditions, so that a target detection result is obtained.
2. The method for detecting a target by fusion of multi-mode progressive perception according to claim 1, wherein the step of assigning semantic tags at corresponding positions in the 2D image data after gaussian filtering processing to each point of the 3D laser radar data based on external parameter calibration parameters of a camera and a laser radar to obtain 3D point cloud tag data comprises the steps of: Performing feature extraction on the 2D image data subjected to Gaussian filtering processing by using a EFFICIENTNET model to obtain a first feature set; Semantic segmentation is carried out on the first feature set based on DeepLabv & lt3+ & gt frames, and semantic tags are distributed to each pixel in the 2D image data; And projecting the 3D laser radar data to an image plane based on external parameter calibration parameters of the camera and the laser radar, and acquiring pixels and semantic tags of each point at corresponding positions on the 2D image data.
3. The method for detecting a target by fusion of multi-modal progressive perception according to claim 1, wherein the acquiring the 3D volume map corresponding to the 2D image data and the voxel map corresponding to the 3D point cloud label data, and performing feature extraction on the 3D volume map and the voxel map respectively to obtain a 3D volume map feature and a voxelized feature respectively, includes: extracting features of the 2D image data through a MobileNet module, and acquiring a 3D volume map corresponding to the 2D image data by combining the external parameter calibration parameters; And converting the 3D point cloud marking data into a voxel map by adopting an adaptive voxelization method.
4. The method for detecting a target by fusion of multi-modal progressive perception according to claim 1, wherein the fusing the 3D volume map feature and the voxelized feature to obtain a first fused feature includes: performing multi-sensor fusion on the 3D volume map feature and the voxelized feature through a Cross-Modality Attention module to obtain a primary fusion feature; And enhancing the primary fusion characteristic by using a Multi-Modality Attention module to obtain the first fusion characteristic.
5. The method for detecting a target by fusion of multi-modal progressive perception as claimed in claim 1, wherein the consistency parameter includes IoU values, the step of determining the consistency parameter of the category label between the second bounding box and the first bounding box after the second bounding box is shot to a 2D plane, and the step of merging the first bounding box and the second bounding box, in which the consistency parameter meets a preset condition, to obtain a target detection result includes: acquiring a first confidence coefficient corresponding to the first boundary frame and a second confidence coefficient corresponding to the second boundary frame; And determining that the consistency parameter meets a preset condition under the condition that the first confidence coefficient is larger than or equal to a preset confidence coefficient threshold value and the second confidence coefficient is larger than or equal to IoU and the IoU is larger than or equal to a preset IoU threshold value.
6. The method for detecting a target by fusion of multi-modal progressive perception according to claim 5, wherein after merging the first bounding box and the second bounding box, in which the consistency parameter satisfies a preset condition, obtaining a target detection result, further comprises: Acquiring a first Euclidean distance of the 2D image data before and after Gaussian filtering, a second Euclidean distance of the 3D laser radar data before and after Gaussian filtering, the number of detection frames with consistency parameters larger than a preset IoU threshold but not consistent in geometry, the intensity difference between laser emission and reflection of the 3D laser radar and the semantic information difference between a first fusion feature and initial 2D image data; And carrying out weighted calculation by using the first Euclidean distance, the second Euclidean distance, the number of detection frames, the intensity difference value and the semantic information difference, and determining the confidence coefficient of the target detection result.
7. The method for detecting a target by fusion of multi-modal progressive perception as claimed in claim 1, wherein the target detection result includes a semantic map, and after merging the first bounding box and the second bounding box in which the consistency parameter satisfies a preset condition, further includes: discretizing the semantic map into a voxel map, and determining a global path planning strategy by combining an A-type algorithm based on semantic weight constraint; Modeling the global path planning strategy as a Markov decision process, and optimizing the Markov decision process by combining with near-end strategy optimization to obtain a target path planning strategy.
8. A target detection system integrating multi-mode progressive perception is characterized by comprising a data acquisition module, a semantic processing module, a characteristic fusion module and a target detection module, wherein, The data acquisition module is configured to acquire 3D laser radar data and 2D image data containing an object to be detected; The semantic processing module is configured to assign semantic tags at corresponding positions in the 2D image data after Gaussian filtering processing to each point of the 3D laser radar data based on external parameter calibration parameters of the camera and the laser radar, so as to obtain 3D point cloud marking data; The feature fusion module is configured to acquire a 3D volume map corresponding to the 2D image data and a voxel map corresponding to the 3D point cloud mark data, respectively perform feature extraction on the 3D volume map and the voxel map to respectively obtain a 3D volume map feature and a voxelized feature; The target detection module is configured to perform 2D convolution operation on the 3D volume map to obtain a first bounding box of the target to be detected, perform 3D convolution operation on the first fusion feature to obtain a second bounding box of the target to be detected, determine a consistency parameter of a category label between the second bounding box and the first bounding box after the second bounding box is shot to a 2D plane, and merge the first bounding box and the second bounding box, wherein the consistency parameter meets a preset condition, so as to obtain a target detection result.
9. An electronic device comprising a processor and a memory, the memory storing a computer program, wherein the computer program when executed by the processor implements the fused multi-modal progressively aware object detection method of any one of claims 1 to 7.
10. A non-transitory computer readable storage medium, having stored thereon a computer program, wherein the computer program when executed by a processor implements the fusion multi-modal progressively aware object detection method of any of claims 1 to 7.

Description

Target detection method and system integrating multi-mode progressive perception Technical Field The invention relates to the technical field of image processing, in particular to a target detection method and system integrating multi-mode progressive perception. Background With the rapid development of automatic driving and intelligent robot technology, the high precision and high robustness of an environment sensing system have become core preconditions for realizing safe and reliable navigation. The existing navigation system relies on a single sensor to sense the environment, but has obvious limitation on the sensing performance under complex dynamic environment and extreme working conditions. The navigation system based on vision mainly acquires two-dimensional image information through a camera, and utilizes target detection, semantic segmentation and depth estimation algorithms to realize environment understanding. The method has advantages in texture and semantic expression, but is easily influenced by factors such as illumination change, rain and fog shielding, backlight, long-distance imaging precision attenuation and the like, and particularly in dynamic or severe environments, the image quality reduction directly leads to obvious reduction of target detection stability and accuracy. Although the navigation system based on the laser radar can provide accurate depth and geometric information, the navigation system has stronger robustness in the aspects of obstacle detection and map construction. However, the point cloud data acquired by the laser radar are sparse and lack of semantic features, so that small objects are difficult to identify, meanwhile, the high-precision laser radar is high in cost, the data processing calculation amount is large, and the performance of a low-precision model is difficult to meet the requirements of complex scenes. In order to break through the performance bottleneck of a single sensor, the multi-mode fusion technology becomes a research hot spot in recent years. The combination of lidar and camera has natural complementarity. However, the existing multi-mode fusion technology stays in a shallow fusion stage, for example, a simple splicing detection result or later decision stage fusion is carried out, deep association of an image and point cloud in a characteristic layer cannot be fully mined, so that semantic and geometric information are not fully cooperated, multi-mode data are not effectively fused, and further improvement of target detection precision and scene understanding capability is restricted. Disclosure of Invention In view of the above, the invention provides a target detection method and system integrating multi-mode progressive perception. The technical scheme of the invention is realized in such a way that the first aspect of the invention provides a target detection method integrating multi-mode progressive perception, which comprises the following steps: Acquiring 3D laser radar data and 2D image data containing an object to be detected; Based on external parameter calibration parameters of a camera and a laser radar, giving semantic tags at corresponding positions in the 2D image data after Gaussian filtering processing to each point of the 3D laser radar data to obtain 3D point cloud marking data; Acquiring a 3D volume map corresponding to the 2D image data and a voxel map corresponding to the 3D point cloud mark data, respectively carrying out feature extraction on the 3D volume map and the voxel map to respectively obtain a 3D volume map feature and a voxelized feature; And performing 2D convolution operation on the 3D volume map to obtain a first boundary frame of the target to be detected, performing 3D convolution operation on the first fusion feature to obtain a second boundary frame of the target to be detected, determining consistency parameters of category labels between the second boundary frame and the first boundary frame after the second boundary frame is shot to a 2D plane, and merging the first boundary frame and the second boundary frame, wherein the consistency parameters meet preset conditions, so that a target detection result is obtained. On the basis of the above technical solution, preferably, the method for obtaining 3D point cloud mark data based on the camera and the laser radar parameter of the external parameter endows semantic tags at corresponding positions in the 2D image data after gaussian filtering processing to each point of the 3D laser radar data, includes: Performing feature extraction on the 2D image data subjected to Gaussian filtering processing by using a EFFICIENTNET model to obtain a first feature set; Semantic segmentation is carried out on the first feature set based on DeepLabv & lt3+ & gt frames, and semantic tags are distributed to each pixel in the 2D image data; And projecting the 3D laser radar data to an image plane based on external parameter calibration parameters of the camera and the lase