US-20260127757-A1 - MULTI-VIEW-BASED 3D OBJECT DETECTION METHOD AND SYSTEM

US20260127757A1US 20260127757 A1US20260127757 A1US 20260127757A1US-20260127757-A1

Abstract

The provided is a multi-view-based 3D (three-dimensional) object detection method and system. This method is optimized based on an existing three-dimensional object detection model PETR. Specifically, 3DRoPE is introduced to replace an original 3D position encoding mode, and learnable parameters are set for 3D position information to enhance adaptability. In addition, the pose geometric information of multi-view cameras is fused into the position embedding of each view image, so that the capability of this model for processing the complex spatial relationship is further improved. After being trained using the NuScenes dataset, this model can receive real-time image input from a plurality of cameras and output accurate 3D object detection results, which significantly improves perception accuracy and safety in autonomous driving environments.

Inventors

Xinzhong ZHU
Huiying XU
Hongbo Li
Ke Sheng
Wei Shi
Weifeng Su
Xiao Huang

Assignees

ZHEJIANG NORMAL UNIVERSITY
BEIJING GEEKPLUS TECHNOLOGY CO., LTD.
Hangzhou Zongheng Communication Co., Ltd

Dates

Publication Date: 20260507
Application Date: 20251105
Priority Date: 20241106

Claims (8)

1 . A multi-view-based 3D object detection method, comprising: constructing a multi-view 3D object detection model in an autonomous driving scene based on three-dimensional object detection PETR improvement; replacing original 3D position embedding in original three-dimensional object detection PETR by using 3D position embedding 3DRoPE based on rotation embedding RoPE; setting a learnable parameter for to-be-embedded 3D position information; and fusing pose geometric information of multi-view cameras into 3D position embedding 3DRoPE of an image under each view to obtain an improved 3D object detection model; training the improved 3D object detection model by using a NuScenes dataset in an autonomous driving scene to obtain a multi-view 3D object detection model in the autonomous driving scene; and inputting images acquired by the multi-view cameras in the autonomous driving scene into the 3D object detection model to obtain a 3D object detection result in the autonomous driving scene.
2 . The multi-view-based 3D object detection method according to claim 1 , wherein the 3D position embedding 3DRoPE based on rotation embedding RoPE comprises: dividing 3D query and 3D point position information embedding of a 2D image into three parts, namely, for each 3D point position P n (p x , p y , p z ), applying 1-dimensional rotation embedding RoPE to position information of xyz according to a sequence of x, y and z, and embedding and concatenating positions of three dimensions to form a complete 3D position embedding 3DRoPE, defined as: generating a rotation matrix of different dimensionality position information according to the position information of different axes xyz: R x = e i ⁢ θ tx ⁢ p x , R y = e i ⁢ θ ty ⁢ p y , R z = e i ⁢ θ tz ⁢ p z wherein R x , R y , R z ∈C N×(d head /6) , R x , R y , R z represent rotation matrices of the position information of different axes, d head represents a number of channels of image features and 3D queries, θ t represents a frequency used in sinusoidal and cosine encoding methods to extend two-dimensional complex rotation to high-dimensional rotations corresponding to image and 3D query vectors, θ tx , θ ty , θ tz =10000 −t/(d head /6) , t∈{0, 1, . . . , d head /6}, and θ tx , θ ty , θ tz represent frequencies of rotations for position information of different axes, respectively; dividing related image feature vectors and 3D query vectors obtained from images captured by the multi-view cameras through a convolutional neural network into three equal parts according to the number of channels to apply rotations of position information of different dimensions: Q x = { Q 1 , Q 2 , … , Q n - 5 , Q n - 4 } Q y = { Q 3 , Q 4 , … , Q n - 3 , Q n - 2 } Q z = { Q 5 , Q 6 , … , Q n - 1 , Q n } Q x ′ = Q x ⁢ e i ⁢ θ tx ⁢ p x , Q y ′ = Q y ⁢ e i ⁢ θ ty ⁢ p y , Q z ′ = Q z ⁢ e i ⁢ θ tz ⁢ p z Q r ⁢ o ⁢ s = cat ( Q x ′ , Q y ′ , Q z ′ ) wherein Q x , Q y , Q z represent the vectors divided into three parts, Q′ x , Q′ y , Q′ z represent vectors after rotation containing position information of each dimension, Q ros represents a vector obtained by concatenating three rotated vectors into a complete position embedding vector containing three-dimensional point position information, and cat represents concatenation; processing embedding vectors of the image features and the 3D queries obtained from images captured by the multi-view cameras through a convolutional neural network by attention computation to represent a relative positional relationship between the image features and the 3D queries in an attention matrix: A ( n , m ) ′ = Re [ q n ′ ⁢ k m ′ ⋆ ] = R ⁢ e [ q n ⁢ k m * ⁢ e i ⁡ ( n - m ) ⁢ θ ] wherein A′ (n,m) represents the attention matric after computation, q n ′ , k m ′ ⋆ represent rotated queries and keys, e i(n−m)θ represents a result of the attention computation, and n and m represent two different positions.
3 . The multi-view-based 3D object detection method according to claim 1 , wherein setting the learnable parameter for the to-be-embedded 3D position information comprises: P α 3 ⁢ d = α ⁡ ( p x , p y , p z ) wherein P α 3 ⁢ d represents adaptive 3D point position information, α represents the learnable parameter, and p y and p z represent 3D point positions.
4 . The multi-view-based 3D object detection method according to claim 1 , wherein fusing the pose geometric information of the multi-view cameras into the 3D position embedding 3DRoPE of the image under each view comprises: calculating a pose [ q n , t n ] of each-view camera using intrinsic and extrinsic parameters of the each-view camera, wherein q n represents a quaternion vector indicating rotation, and t n represents a position vector; capturing corresponding geometric attributes using a Fourier transform: γ ⁡ ( x | [ f l , … , f k ] ) = [ sin ⁡ ( f l ⁢ π ⁢ x ) , cos ⁡ ( f l ⁢ π ⁢ x ) , … ] wherein γ( ) represents a Fourier transform function, [f 1 , . . . , f k ] are k frequencies evenly sampled from [0, f max ], and x represents attributes of each camera pose; mapping geometric attributes after the Fourier transform to dimensions corresponding to image features using a multi-layer perceptron (MLP): G n e = M ⁢ L ⁢ P enc ( γ [ q _ n , t n ] ) wherein G n e represents pose geometric embedding of each-view camera, q n represents a quaternion vector indicating rotation, and t n represents a position vector; and adding G n e to the position embedding to form a complete pose-enhanced position embedding.
5 . A multi-view-based 3D object detection system using the multi-view-based 3D object detection method according to claim 1 , comprising: a model construction module, configured to construct the multi-view 3D object detection model in the autonomous driving scene based on the three-dimensional object detection PETR improvement; replace the original 3D position embedding in the original three-dimensional object detection PETR by using the 3D position embedding 3DRoPE based on the rotation embedding RoPE; set the learnable parameter for the to-be-embedded 3D position information; and fuse the pose geometric information of the multi-view cameras into the 3D position embedding 3DRoPE of the image under each view to obtain the improved 3D object detection model; a model training module, configured to train the improved 3D object detection model by using the NuScenes dataset in the autonomous driving scene to obtain the multi-view 3D object detection model in the autonomous driving scene; and a model detection module, configured to input the images acquired by the multi-view cameras in the autonomous driving scene into the 3D object detection model to obtain the 3D object detection result in the autonomous driving scene.
6 . The multi-view-based 3D object detection system according to claim 5 , wherein in the multi-view-based 3D object detection method, the 3D position embedding 3DRoPE based on rotation embedding RoPE comprises: dividing 3D query and 3D point position information embedding of a 2D image into three parts, namely, for each 3D point position P n (p x , p y , p z ), applying 1-dimensional rotation embedding RoPE to position information of xyz according to a sequence of x, y and z, and embedding and concatenating positions of three dimensions to form a complete 3D position embedding 3DRoPE, defined as: generating a rotation matrix of different dimensionality position information according to the position information of different axes xyz: R x = e i ⁢ θ tx ⁢ p x , R y = e i ⁢ θ ty ⁢ p y , R z = e i ⁢ θ tz ⁢ p z wherein R x , R y , R z ∈C N×(d head /6) , R x , R y , R z represent rotation matrices of the position information of different axes, d head represents a number of channels of image features and 3D queries, θ t represents a frequency used in sinusoidal and cosine encoding methods to extend two-dimensional complex rotation to high-dimensional rotations corresponding to image and 3D query vectors, θ tx , θ ty , θ tz =10000 −t/(d head /6) , t∈{0, 1, . . . , d head /6}, and θ tx , θ ty , θ tz represent frequencies of rotations for position information of different axes, respectively; dividing related image feature vectors and 3D query vectors obtained from images captured by the multi-view cameras through a convolutional neural network into three equal parts according to the number of channels to apply rotations of position information of different dimensions: Q x = { Q 1 , Q 2 , … , Q n - 5 , Q n - 4 } Q y = { Q 3 , Q 4 , … , Q n - 3 , Q n - 2 } Q z = { Q 5 , Q 6 , … , Q n - 1 , Q n } Q x ′ = Q x ⁢ e i ⁢ θ tx ⁢ p x , Q y ′ = Q y ⁢ e i ⁢ θ ty ⁢ p y , Q z ′ = Q z ⁢ e i ⁢ θ tz ⁢ p z Q ros = cat ⁡ ( Q x ′ , Q y ′ , Q z ′ ) wherein Q x , Q y , Q z represent the vectors divided into three parts, Q′ x , Q′ y , Q′ z represent vectors after rotation containing position information of each dimension, Q ros represents a vector obtained by concatenating three rotated vectors into a complete position embedding vector containing three-dimensional point position information, and cat represents concatenation; processing embedding vectors of the image features and the 3D queries obtained from images captured by the multi-view cameras through a convolutional neural network by attention computation to represent a relative positional relationship between the image features and the 3D queries in an attention matrix: A ( n , m ) ′ = Re [ q n ′ ⁢ k m ′ * ] = Re [ q n ⁢ k m * ⁢ e i ⁡ ( n - m ) ⁢ θ ] wherein A′ (n,m) represents the attention matrix after computation, q n ′ , k m ′ * represent rotated queries and keys, e i(n−m)θ represents a result of the attention computation, and n and m represent two different positions.
7 . The multi-view-based 3D object detection system according to claim 5 , wherein in the multi-view-based 3D object detection method, setting the learnable parameter for the to-be-embedded 3D position information comprises: P α 3 ⁢ d = α ⁡ ( p x , p y , p z ) wherein P α 3 ⁢ d represents adaptive 3D point position information, α represents the learnable parameter, and p y and p z represent 3D point positions.
8 . The multi-view-based 3D object detection system according to claim 5 , wherein in the multi-view-based 3D object detection method, fusing the pose geometric information of the multi-view cameras into the 3D position embedding 3DRoPE of the image under each view comprises: calculating a pose [ q n , t n ] of each-view camera using intrinsic and extrinsic parameters of the each-view camera, wherein q n represents a quaternion vector indicating rotation, and t n represents a position vector; capturing corresponding geometric attributes using a Fourier transform: γ ⁡ ( x | [ f l , … , f k ] ) = [ sin ⁡ ( f l ⁢ π ⁢ x ) , cos ⁡ ( f l ⁢ π ⁢ x ) , … ] wherein γ( ) represents a Fourier transform function, [f 1 , . . . , f k ] are k frequencies evenly sampled from [0, f max ], and x represents attributes of each camera pose; mapping geometric attributes after the Fourier transform to dimensions corresponding to image features using a multi-layer perceptron (MLP): G n e = M ⁢ L ⁢ P enc ( γ [ q _ n , t n ] ) wherein G n e represents pose geometric embedding of each-view camera, q n represents a quaternion vector indicating rotation, and t n represents a position vector; and adding G n e to the position embedding to form a complete pose-enhanced position embedding.

Description

CROSS-REFERENCE TO THE RELATED APPLICATIONS This application is based upon and claims priority to Chinese Patent Application No. 202411575301.7, filed on Nov. 6, 2024, the entire contents of which are incorporated herein by reference. TECHNICAL FIELD The present invention relates to the technical field of computer vision, and more specifically, to a multi-view-based 3D object detection method and system. BACKGROUND At present, there are mainly two methods based on the bird's-eye-view (BEV) perspective. One method converts two-dimensional image features into dense BEV features, thereby performing subsequent tasks such as object detection using the BEV features. The other method, referred to as a sparse query-based method, performs interaction directly using global three-dimensional (3D) queries and image features through an attention mechanism, and updates the 3D queries using a decoder to complete the detection task. The first method requires conversion of image features into explicit BEV features, which demands more computational resources and is therefore less efficient. The second method more directly and efficiently utilizes 3D queries and image features for interaction, resulting in lower computational resource requirements and higher computational efficiency compared to the first method. However, the second method still exhibits a gap in detection accuracy and related performance metrics. For practical applications, the second method clearly has a significant advantage. However, the current implementation of this method exhibits insufficient detection accuracy. Although the detection speed is relatively fast, the low detection accuracy limits the applicability of this method, particularly for real-world autonomous driving scenes. Therefore, how to improve detection accuracy of the sparse query-based method while maintaining the computational efficiency advantage of this method is an issue urgently to be resolved by those skilled in the art. SUMMARY In view of this, the present invention provides a multi-view-based 3D object detection method and system, to resolve a problem in Background part. To achieve the above objective, the present invention provides the following technical solutions. A multi-view-based 3D object detection method includes: constructing a multi-view 3D object detection model in an autonomous driving scene based on three-dimensional object detection PETR improvement; replacing the original 3D position embedding in original three-dimensional object detection PETR by using 3D position embedding 3DRoPE based on rotation embedding ROPE; setting a learnable parameter for to-be-embedded 3D position information; and fusing pose geometric information of multi-view cameras into 3D position embedding 3DRoPE of an image under each view to obtain an improved 3D object detection model;training the improved 3D object detection model by using a NuScenes dataset in an autonomous driving scene to obtain a multi-view 3D object detection model in the autonomous driving scene; andinputting images acquired by the multi-view cameras in the autonomous driving scene into the 3D object detection model to obtain a 3D object detection result in the autonomous driving scene. Preferably, the 3D position embedding 3DRoPE based on rotation embedding RoPE includes: dividing 3D query and 3D point position information embedding of a 2D image into three parts, namely, for each 3D point position Pn(px, py, pz), applying 1-dimensional rotation embedding RoPE to position information of xyz according to a sequence of x, y and z, and embedding and concatenating positions of three dimensions to form a complete 3D position embedding 3DRoPE, defined as:generating a rotation matrix of different dimensionality position information according to the position information of different axes xyz: Rx=ei⁢θtx⁢px,Ry=ei⁢θty⁢py,Rz=ei⁢θtz⁢pzwhere Rx, Ry, Rz∈CN×(dhead/6), Rx, Ry, Rz represent rotation matrices of the position information of different axes, dhead represents a number of channels of image features and 3D queries, θt represents a frequency used in sinusoidal and cosine encoding methods to extend two-dimensional complex rotation to high-dimensional rotations corresponding to image and 3D query vectors, θtx, θty, θtz=10000−t/(dhead/6), t∈{0, 1, . . . , dhead/6}, and θtx, θty, θtz represent frequencies of rotations for position information of different axes, respectively;dividing related image feature vectors and 3D query vectors obtained from images captured by the multi-view cameras through a convolutional neural network into three equal parts according to the number of channels to apply rotations of position information of different dimensions: Qx={Q1,Q2,… ,Qn-5,Qn-4}Qy={Q3,Q4,… ,Qn-3,Qn-2}Qz={Q5,Q6,… ,Qn-1,Qn}Qx′=Qx⁢ei⁢θtx⁢px,Qy′=Qy⁢ei⁢θty⁢py,Qz′=Qz⁢ei⁢θtz⁢pzQros=cat⁢ (Qx′,Qy′,Qz′)where Qx, Qy, Qz represent the vectors divided into three parts, Qx′, Qy′, Qz′ represent vectors after rotation containing position information