CN-122024322-A - Robot attitude estimation method based on 3D Gaussian splashing and dual-attention ray scoring network

CN122024322ACN 122024322 ACN122024322 ACN 122024322ACN-122024322-A

Abstract

The invention discloses a robot gesture estimation method based on a 3D Gaussian splashing and dual-attention ray scoring network, which aims to solve the problems that a single RGB image method is easily influenced by initial gesture dependence and rotation ambiguity, and a depth or multi-view dependent method is high in storage and data acquisition. According to the invention, a 3D Gaussian scene model of an indoor environment is obtained based on a 3D Gaussian splashing method, a dual-attention ray scoring network (DARS-Net) is established, the ray scoring is decomposed into position scoring and direction scoring through an improved geometric scoring mechanism, the position scoring rays are used for predicting the position of a camera, and the high-direction scoring rays are used for predicting the orientation of the camera. DARS-Net effectively overcomes rotational ambiguity and significantly improves translational and rotational accuracy. Finally, the rough pose obtained based on the 3DGS rays is further refined through efficient feature point matching. The invention belongs to the field of indoor service robots.

Inventors

ZHAO LIJUN
JIANG ZHIQIANG
KONG QINGJIA
WANG HUAQI

Assignees

哈尔滨工业大学
哈工大郑州研究院

Dates

Publication Date: 20260512
Application Date: 20260202

Claims (10)

1. The robot attitude estimation method based on the 3D Gaussian splashing and dual-attention ray scoring network is characterized by comprising the following steps of: S1, acquiring a multi-view RGB image sequence of an indoor environment by using a robot, and processing the multi-view RGB image sequence by using a motion recovery structure to obtain a camera pose and a sparse point cloud of the indoor environment; S2, initializing each point in the sparse point cloud into a 3D Gaussian primitive based on a 3D Gaussian splashing method, projecting each 3D Gaussian primitive to a two-dimensional image plane of a camera according to the pose of the camera for rendering, calculating the loss of the rendered image and the RGB image in S1, and gradually updating Gaussian attributes by using back propagation to obtain a 3D Gaussian scene model; S3, establishing a dual-attention ray scoring network, wherein the dual-attention ray scoring network comprises an enhanced MLP network, a DINOv main network and an attention module And an attention module The enhanced MLP network is arranged in parallel with DINOv backbone network, and attention module And an attention module Arranged in parallel, the enhanced MLP network is respectively connected with the attention module And an attention module Connected DINOv main network and attention module And an attention module Connecting; S4, acquiring a monocular RGB image to be detected of the indoor environment in S1, inputting the monocular RGB image to be detected into a 3D Gaussian scene model to obtain projection rays, inputting the monocular RGB image to be detected and the projection rays into a dual-attention ray scoring network, and outputting the position score and the direction score of the projection rays; S5, obtaining rays with high confidence coefficient by utilizing a Top-K screening strategy based on the position score and the direction score, solving the rough position and the rough orientation of the camera by utilizing a direction vector weighted sum algorithm according to the rays with high confidence coefficient, and generating a rendering view of the indoor environment according to the rough position and the rough orientation of the camera and a 3D Gaussian scene model; And S6, establishing a corresponding relation between the monocular RGB image to be detected and the rendering view by using a feature matching algorithm to obtain a pose residual error of the robot, and performing minimized solution on the pose residual error by using a PnP algorithm to obtain the pose of the 6-degree-of-freedom robot.
2. The method for estimating a robot pose based on a 3D Gaussian splatter and dual-attention ray scoring network as set forth in claim 1, wherein the probability density function of the 3D Gaussian primitive in S2 is The method comprises the following steps: (1) Wherein, the Is the mean vector of the 3D gaussian primitive, Is the covariance matrix of the 3D gaussian primitive, Is the base of the natural logarithm, Is a three-dimensional coordinate vector of any point in the indoor environment, Is that Is a transpose of (a).
3. The method for estimating the robot pose based on the 3D Gaussian splashing and dual-attention ray scoring network according to claim 2, wherein the specific process of S2 is as follows: s21, initializing each point in the sparse point cloud into a 3D Gaussian primitive based on a 3D Gaussian splashing method; s22, projecting each 3D Gaussian primitive to a two-dimensional image plane of a camera for rendering according to the pose of the camera, wherein the specific process is as follows: (1) Transforming matrix by view according to camera pose And jacobian matrix obtained by projective transformation affine approximation Calculating to obtain covariance matrix of each 3D Gaussian primitive under camera coordinate system : (2) (2) Mean vector from 3D Gaussian center Covariance matrix with each 3D gaussian primitive Obtaining a three-dimensional Gaussian distribution by using a formula (1) ; (3) Sequencing each 3D Gaussian primitive overlapped in the three-dimensional Gaussian distribution according to depth from shallow to deep by adopting a micro-rasterizer based on a block, determining the weight of each 3D Gaussian primitive according to the sequencing result, and calculating the color of each 3D Gaussian primitive projected on a pixel according to the weight to obtain the rendering result of each pixel; S23, obtaining a rendered image according to the rendering result of each pixel, calculating the loss of the rendered image and the RGB image in S1, and gradually updating the Gaussian attribute by using back propagation to obtain a 3D Gaussian scene model.
4. The method for estimating a robot pose based on a 3D Gaussian splatter and dual-attention ray scoring network as set forth in claim 3, wherein the specific process of (3) in S22 is: The tile-based micro-rasterizer employs A mixing strategy; The micro-rasterizer based on the image block firstly sorts each overlapped 3D Gaussian primitive in the three-dimensional Gaussian distribution according to the depth of the 3D Gaussian primitive from shallow to deep, then determines the weight of each 3D Gaussian primitive according to the sorting result, finally calculates the color of each 3D Gaussian primitive projected on the pixel according to the weight, and then uses the three-dimensional Gaussian primitive to obtain the three-dimensional Gaussian distribution The mixing strategy accumulates the color attribute of each pixel, and the color obtained by accumulating each pixel The method comprises the following steps: (3) Wherein, the To participate in the number of gaussian primitives accumulated for each pixel, Is the first The color of the individual gaussian primitives is chosen, , Is the first The opacity contribution of the individual gaussian primitives, In order to accumulate the transmittance of the light, Is the first Opacity contribution of the individual gaussian primitives; And obtaining a rendering result of each pixel.
5. The method for estimating the pose of the robot based on the 3D Gaussian splashing and dual-attention ray scoring network according to claim 4, wherein the specific process of S23 is as follows: and accumulating each pixel continuously to obtain a rendered image corresponding to the RGB image acquired by the S1, calculating the loss of the rendered image and the RGB image continuously, and updating the Gaussian attribute step by step through back propagation to obtain the 3D Gaussian scene model.
6. The robot gesture estimation method based on the 3D Gaussian splashing and dual-attention ray scoring network of claim 5, wherein the enhanced MLP network in S3 sequentially comprises a position coding layer, a full-connection feature extraction layer and a feature projection layer, the full-connection feature extraction layer is formed by inserting multiple layers of linear layers and nonlinear activation functions, and the full-connection feature extraction layer adopts residual connection or jump connection.
7. The method for estimating a robot pose based on a 3D Gaussian splatter and dual-attention-ray scoring network as set forth in claim 6, wherein the attention module in S3 And an attention module Attention module with same structure and same input And an attention module All sequentially comprise a full connection layer, a correlation calculation layer, a normalization layer and a Softmax layer.
8. The method for estimating the robot pose based on the 3D Gaussian splashing and dual-attention ray scoring network according to claim 7, wherein the specific process of S4 is as follows: S41, acquiring a monocular RGB image to be detected of an indoor environment in S1, inputting the monocular RGB image to be detected into a 3D Gaussian scene model, generating a plurality of projection rays for each Gaussian ellipsoid by the 3D Gaussian scene model, and automatically screening rays corresponding to the monocular RGB image to be detected; s42, inputting the monocular RGB image to be detected and rays corresponding to the monocular RGB image to be detected into a dual-attention ray scoring network, and outputting the position score and the direction score of each ray, wherein the specific process is as follows: s421, inputting the monocular RGB image to be detected into DINOv main network, and outputting image feature set; S422, inputting rays corresponding to the monocular RGB image to be detected into a position coding layer of an enhanced MLP network, mapping each ray to a high-dimensional space by utilizing a sine function, a cosine function or hash coding according to original geometric parameters of each ray, and generating a high-frequency position embedded vector of each ray, wherein the original geometric parameters consist of a ray starting point and a coordinate vector of a ray direction; embedding vectors at high-frequency positions of each ray into the full-connection feature extraction layer, and outputting fusion features of each ray; inputting the fusion characteristic of each ray into a characteristic projection layer, adjusting the fusion characteristic dimension of each ray into a preset characteristic dimension through linear transformation by the characteristic projection layer, and outputting a characteristic set of each ray; s423, inputting the feature set and the image feature set of each ray into the attention module In the full connection layer of (1), taking the ray characteristics as query and the image characteristics as keys, mapping the ray characteristics and the image characteristics to a unified attention dimension space, and generating a query vector set and a key vector set; Inputting the query vector set and the key vector set into a correlation calculation layer, and performing matrix multiplication operation on the query vector set and the key vector set to obtain a plurality of original attention diagrams with the size of 'ray quantity x image pixel quantity'; Inputting the original attention force diagrams into a normalization layer to obtain normalized original attention force diagrams; Inputting a plurality of normalized original attention attempts into a Softmax layer, and performing row-wise summation operation on each normalized original attention attempt, namely accumulating the attention weights of a single ray relative to all pixels, and outputting the direction score of each ray; The expression of the direction score is: (4) Wherein, the As the direction score, the number of directions, For attention module Middle (f) A number of pixels of the pixel array are arranged, , For the width of the image to be the same, Is the image height; s424, inputting the feature set and the image feature set of each ray into the attention module In the full connection layer of (1), taking the ray characteristics as query and the image characteristics as keys, mapping the ray characteristics and the image characteristics to a unified attention dimension space, and generating a query vector set and a key vector set; inputting the query vector set and the key vector set into a correlation calculation layer, and performing matrix multiplication operation on the query vector set and the key vector set to obtain a plurality of original attention diagrams with the size of 'ray quantity x image pixel quantity'; Inputting the original attention force diagrams into a normalization layer to obtain normalized original attention force diagrams; Inputting a plurality of normalized original attention attempts into a Softmax layer, and performing row-wise summation operation on each normalized original attention attempt, namely accumulating the attention weights of a single ray relative to all pixels, and outputting the position score of each ray; the expression of the position score is: (5) Wherein, the As a score of the location of the object, For attention module Middle (f) And each pixel.
9. The method for estimating the robot pose based on the 3D Gaussian splashing and dual-attention ray scoring network according to claim 8, wherein the specific process of S5 is as follows: S51, selecting by using a Top-K screening strategy based on the position score and the direction score of each ray A high confidence ray; S52, according to The method is characterized in that a high-confidence ray is used for solving the rough position and rough orientation of a camera by utilizing a direction vector weighted sum algorithm and combining a weighted least square method, and the specific process is as follows: the direction vector weighted sum algorithm is as follows: (6) Wherein, the For the rough position of the camera, , Is the first The position score of the bar ray is determined, Is a matrix of units which is a matrix of units, Is the first The direction vector corresponding to the bar ray is, Is that Is to be used in the present invention, Is the first A starting point of the bar ray; Solving a direction vector weighted sum algorithm by using a weighted least square method to obtain a rough position of the camera; (7) Wherein, the For the rough orientation of the camera, Is the first A directional score of the bar rays; and S53, generating a rendering view of the indoor environment according to the rough position and rough orientation of the camera and the 3D Gaussian scene model.
10. The method for estimating the robot pose based on the 3D Gaussian splashing and dual-attention ray scoring network according to claim 9, wherein the specific process of S6 is as follows: S61, extracting and matching 2D characteristic points between the monocular RGB image to be detected and the rendering view by using a LoFTR method; S62, projecting the 2D characteristic points on the rendering view obtained in the S61 to a 3D Gaussian scene model by utilizing depth information of the 3D Gaussian primitives and camera internal parameters to obtain a 2D characteristic point-3D characteristic point corresponding relation, and obtaining pose residual errors of the robot according to the 2D characteristic point-3D characteristic point corresponding relation; And S63, carrying out minimized solution on the pose residual error of the robot by utilizing a PNP algorithm to obtain the pose of the robot with 6 degrees of freedom.

Description

Robot attitude estimation method based on 3D Gaussian splashing and dual-attention ray scoring network Technical Field The invention relates to the field of indoor service robots, in particular to an indoor service robot posture estimation method based on a 3D Gaussian splashing and dual-attention ray scoring network. Background In the field of indoor service robots and Augmented Reality (AR) navigation, 6-degree-of-freedom (6-DOF) pose estimation is a basis for realizing obstacle avoidance, grabbing and virtual-real fusion of the indoor service robots. At present, the task mainly relies on an RGB-D sensor for environmental perception, and although the RGB-D sensor has become the standard of a plurality of sweeping robots or interactive robots, the RGB-D sensor is often invalid when facing to high reflective or transparent materials such as landing glass windows and mirrors, so that indoor service robots become aviated or collide, and meanwhile, the popularization of consumer products is limited by the expensive sensor cost. Although the monocular RGB scheme has wider applicability, the existing example-level RGB pose estimation algorithm usually assumes an accurate three-dimensional CAD model of a known target, but in an actual home scene, furniture styles are quite different, a large number of weak texture areas (such as pure wall surfaces and cabinets) exist, the placement and the types of objects are dynamically changed, and the indoor service robot is required to perform generalized perception under the condition that 3D models of all objects cannot be acquired in advance, so that the problem that the current technology is difficult to surmount is solved. To break through this obstacle, instead of relying on object CAD models, researchers have begun to turn to assist pose estimation with implicit representations of the scene itself, the most representative of which is scene reconstruction. Scene reconstruction mainly includes a neural radiation field (NeRF) and 3D Gao Sisan points (3 DGS). However, the direct application of these reconstruction techniques to monocular pose estimation remains a challenge. The NeRF-based method implicit volume representation requires extensive computational ray stepping, making real-time applications difficult to implement, and training requires dense multi-view images, contradicting the single image push required in many tasks. They still rely heavily on initial pose and are prone to local optima. The methods such as SplatLoc of 3DGS and 3DGS-ReLoc rely on dense depth and multi-view images to finish scene reconstruction and initial pose acquisition, so that storage and data acquisition costs are obviously increased. In contrast, single RGB image based methods (e.g., 6 DGS) directly utilize 3DGS micro-renderable by rendering inversion, avoiding the use of depth or multi-frame images, but the 6DGS Gaussian ellipsoid ray sampling strategy can produce rotational ambiguity because it favors rays with minimal perpendicular distance from the optical center, but ignores angular deviation. In summary, single RGB image methods are susceptible to initial pose dependence and rotational ambiguity, while depth or multi-view dependent methods are costly to store and data collect. Disclosure of Invention The invention aims to solve the problems that a single RGB image method is easily influenced by initial gesture dependence and rotation ambiguity, and a depth or multi-view dependent method is high in storage and data acquisition, and further provides a robot gesture estimation method based on a 3D Gaussian splashing and dual-attention ray scoring network. The technical scheme adopted by the invention is as follows: it comprises the following steps: S1, acquiring a multi-view RGB image sequence of an indoor environment by using a robot, and processing the multi-view RGB image sequence by using a motion recovery structure to obtain a camera pose and a sparse point cloud of the indoor environment; S2, initializing each point in the sparse point cloud into a 3D Gaussian primitive based on a 3D Gaussian splashing method, projecting each 3D Gaussian primitive to a two-dimensional image plane of a camera according to the pose of the camera for rendering, calculating the loss of the rendered image and the RGB image in S1, and gradually updating Gaussian attributes by using back propagation to obtain a 3D Gaussian scene model; S3, establishing a dual-attention ray scoring network, wherein the dual-attention ray scoring network comprises an enhanced MLP network, a DINOv main network and an attention module And an attention moduleThe enhanced MLP network is arranged in parallel with DINOv backbone network, and attention moduleAnd an attention moduleArranged in parallel, the enhanced MLP network is respectively connected with the attention moduleAnd an attention moduleConnected DINOv main network and attention moduleAnd an attention moduleConnecting; S4, acquiring a monocular RGB image to be dete