CN-121982675-A - Three-dimensional target detection method and system combined with two-dimensional auxiliary network

CN121982675ACN 121982675 ACN121982675 ACN 121982675ACN-121982675-A

Abstract

The invention relates to a three-dimensional target detection method and system combined with a two-dimensional auxiliary network. The method comprises the steps of firstly collecting multi-view images, extracting a first feature map, then predicting the first feature map through a two-dimensional auxiliary network to generate a second feature map containing boundary frame center point coordinates, category, centrality, depth and orientation angle information, then calculating sampling weights according to the category and centrality information, screening the second feature map to obtain effective indexes, then extracting position codes and semantic codes through the effective indexes to construct three-dimensional self-adaptive query, finally splicing the three-dimensional self-adaptive query with sparse query, sending the three-dimensional self-adaptive query into a transducer decoder to complete feature interaction, and outputting a three-dimensional target detection result after being processed by a prediction head. Compared with the prior art, the method has the advantages of improving the accuracy and the robustness of target detection and the like.

Inventors

YU MU
WANG JUN
GUO YAFENG
ZHANG CHAOJIE
DONG YANCHAO

Assignees

同济大学

Dates

Publication Date: 20260505
Application Date: 20251222

Claims (10)

1. A three-dimensional object detection method combined with a two-dimensional auxiliary network, characterized by comprising the steps of: S1, acquiring multi-view images, extracting a first feature image, and initializing a group of sparse queries in a normalized three-dimensional space through uniform distribution; s2, predicting the first feature map by utilizing a two-dimensional auxiliary network to obtain a second feature map containing the coordinates of the central point of the boundary frame, category information, centrality information, depth information and orientation angle information; s3, calculating sampling weight based on the category information and the centrality information, and screening from the second feature map based on the sampling weight to obtain an effective index; S4, extracting according to the effective index to obtain a position code and a semantic code, and constructing a three-dimensional self-adaptive query based on the position code and the semantic code; and S5, after the three-dimensional self-adaptive query and the sparse query are spliced, the three-dimensional self-adaptive query and the sparse query are sent to a transducer decoder for feature interaction, and then the three-dimensional target detection result is output through the processing of a prediction head.
2. The method for three-dimensional object detection in combination with two-dimensional auxiliary network according to claim 1, wherein the specific step of S2 comprises: s21, predicting the coordinates, the category information and the centrality information of the central point of the boundary frame of each pixel in the first feature map by utilizing a two-dimensional feature extraction network; S22, predicting the depth information of each pixel in the first feature map by using a convolution network; s23, calculating the orientation angle information of each pixel in the first feature map by using a convolution network; and S24, obtaining a second characteristic diagram containing the coordinates of the central point of the boundary frame, the category information, the centrality information, the depth information and the orientation angle information.
3. The method for three-dimensional object detection combined with two-dimensional auxiliary network according to claim 2, wherein in the step S22, when predicting the depth information of each pixel in the first feature map by using the convolution network, the depth information is modeled as a classification task of a discrete depth interval, the prediction process specifically comprises determining probability distribution of the feature belonging to each depth interval by using a Softmax function, and selecting the depth interval with the highest probability from the probability distribution by using Argmax operation as the depth information.
4. The method for detecting a three-dimensional object in combination with a two-dimensional auxiliary network according to claim 2, wherein in S23, when the orientation angle information of each pixel in the first feature map is calculated by using the convolution network, specific value information of angles is determined, specifically by sine and cosine values, so as to obtain the orientation angle information.
5. The method for detecting a three-dimensional object in combination with a two-dimensional auxiliary network according to claim 1, wherein the sampling weight in S3 is specifically obtained by performing a dot product calculation on category information and centrality information.
6. The method for detecting a three-dimensional object in combination with a two-dimensional auxiliary network according to claim 1, wherein the specific process of screening to obtain the effective index in S3 includes: the method comprises the steps of presetting an effective threshold value, removing redundant data in a second feature map by using a maximum pooling operation to obtain a third feature map, marking indexes of pixel points with sampling weights larger than the effective threshold value as effective indexes in the third feature map, wherein the specific expression of the effective indexes is as follows: Wherein, the Category information representing each pixel is displayed, Center degree information representing each pixel, The point-of-view is indicated, The maximum pooling is indicated and the maximum pool is indicated, The representation of the effective index is made, Is a preset effective threshold.
7. The method for three-dimensional object detection combined with two-dimensional auxiliary network according to claim 1, wherein the specific process of extracting the position code and the semantic code according to the effective index in S4 comprises: Extracting pixel coordinates and depth values of corresponding pixel points by using an effective index, reconstructing the pixel coordinates and the depth values into three-dimensional space points under a camera coordinate system by back projection operation based on the pixel coordinates and the depth values, then converting the three-dimensional coordinates of the three-dimensional space points from the camera coordinate system to a world coordinate system by using a rotation matrix, and finally processing the converted three-dimensional coordinates by using an MLP structure to generate position codes; Extracting feature information and orientation angle information of corresponding pixel points by using an effective index, reducing redundant feature information, respectively inputting the orientation angle information and the feature information after redundancy reduction into an MLP structure for processing, then splicing the output of the MLP structure in the channel dimension, and processing the spliced result through a residual structure to form semantic coding.
8. The method for three-dimensional object detection combined with two-dimensional auxiliary network according to claim 7, wherein the specific expression of the position code is: Wherein, the An internal reference matrix representing a camera; an extrinsic matrix representing a camera; Representation of Depth information under coordinates; , representing coordinates of a center point of the two-dimensional boundary frame; Representing a post-processing process, consisting of sinusoidal transformation and an MLP structure; Is a three-dimensional coordinate; Is a position code.
9. The method for three-dimensional object detection combined with two-dimensional auxiliary network according to claim 7, wherein the specific expression of semantic coding is: Wherein, the Representing feature information corresponding to the effective index; indicating the orientation angle information corresponding to the effective index, Representing semantic information corresponding to the effective index; ; characteristic information of the pixel points; the orientation angle information of the pixel points; Representing stitching in the channel dimension; LN represents layer normalization; As a parameter, the contribution of the transformation characteristics of the MLP output to the final semantic coding is dynamically balanced; Is semantically encoded.
10. A three-dimensional object detection system in combination with a two-dimensional auxiliary network, the system operating with a three-dimensional object detection method in combination with a two-dimensional auxiliary network according to any one of claims 1-9, the system comprising: the image acquisition and feature extraction module is used for acquiring multi-view images and extracting a first feature image; the two-dimensional auxiliary prediction module is used for predicting the first feature map by utilizing a two-dimensional auxiliary network to obtain a second feature map containing the coordinates of the central point of the boundary frame, category information, centrality information, depth information and orientation angle information; The index screening module is used for calculating sampling weight based on the category information and the centrality information and screening an effective index from the second feature map based on the sampling weight; The query construction module is used for extracting position codes and semantic codes according to the effective indexes and constructing three-dimensional self-adaptive query based on the position codes and the semantic codes; And the detection and output module is used for splicing the three-dimensional self-adaptive query and the sparse query, sending the spliced three-dimensional self-adaptive query and the sparse query into a transducer decoder for characteristic interaction, and outputting a three-dimensional target detection result through the processing of a prediction head.

Description

Three-dimensional target detection method and system combined with two-dimensional auxiliary network Technical Field The invention relates to the technical field of target detection, in particular to a three-dimensional target detection method and system combined with a two-dimensional auxiliary network. Background In the perception task of autopilot, the system needs to understand the surrounding environment accurately and detect objects in the scene. For purely visual autopilot tasks, existing sparse query methods typically generate a limited number of three-dimensional queries only at preset spatial locations. This inherent sparsity limits its ability to accurately model and detect objects within dense scenes, particularly when dealing with long-distance or small-size objects, often presenting certain deficiencies. Therefore, improving the detection effect of the system in dense scenes is important to achieve more intelligent and safer automatic driving. While existing methods have made progress in fusing two-dimensional information to enhance three-dimensional query modeling, these methods generally fail to take full advantage of the potential value of target orientation angle information. An unmanned aerial vehicle image small target detection method based on dynamic filtering and self-adaptive sparse transducer is disclosed in China patent with publication number of CN 121010905A. The method achieves a certain effect in the small target detection scene of the unmanned aerial vehicle by suppressing noise and feature redundancy. However, this approach has the limitation that, although it screens queries using classification and localization information, it focuses primarily on feature enhancement and noise suppression in the two-dimensional image plane, failing to fully mine and exploit the geometric prior information of the target in three-dimensional space, especially ignoring the critical geometric feature of orientation angle. In two-dimensional images, the orientation of the object tends to provide a distinct geometric cue. In the process of generating the three-dimensional self-adaptive query, how to explicitly introduce the orientation angle information of the target is important to reduce the uncertainty in the subsequent orientation angle regression process, and becomes a technical problem to be solved. Existing methods generally fail to explicitly introduce orientation angle information of a target in the process of generating an adaptive query. This lack results in insufficient semantic characterization of the initial query in three-dimensional space, making the model subject to greater uncertainty when subsequently performing three-dimensional box regression. The uncertainty directly limits the understanding capability of the model to complex scenes, so that the existing three-dimensional target detection system is still insufficient in detection precision and robustness, and the safety requirement of high-order automatic driving is difficult to meet. Disclosure of Invention The invention aims to overcome the defects of the prior art and provide a three-dimensional target detection method and system combined with a two-dimensional auxiliary network. The aim of the invention can be achieved by the following technical scheme: According to one aspect of the present invention, there is provided a three-dimensional object detection method in combination with a two-dimensional auxiliary network, the method comprising the steps of: S1, acquiring multi-view images, extracting a first feature image, and initializing a group of sparse queries in a normalized three-dimensional space through uniform distribution; s2, predicting the first feature map by utilizing a two-dimensional auxiliary network to obtain a second feature map containing the coordinates of the central point of the boundary frame, category information, centrality information, depth information and orientation angle information; s3, calculating sampling weight based on the category information and the centrality information, and screening from the second feature map based on the sampling weight to obtain an effective index; S4, extracting according to the effective index to obtain a position code and a semantic code, and constructing a three-dimensional self-adaptive query based on the position code and the semantic code; and S5, after the three-dimensional self-adaptive query and the sparse query are spliced, the three-dimensional self-adaptive query and the sparse query are sent to a transducer decoder for feature interaction, and then the three-dimensional target detection result is output through the processing of a prediction head. As a preferred technical solution, the specific steps of S2 include: s21, predicting the coordinates, the category information and the centrality information of the central point of the boundary frame of each pixel in the first feature map by utilizing a two-dimensional feature extraction network; S22, pr