CN-121973181-A - Robot pose determining method, apparatus, device, storage medium and program product

CN121973181ACN 121973181 ACN121973181 ACN 121973181ACN-121973181-A

Abstract

The application discloses a method, a device, equipment, a storage medium and a program product for determining the pose of a robot, which relate to the technical field of machine learning and comprise the steps of obtaining a two-dimensional image of a working scene of the robot, a three-dimensional point cloud and a task instruction text based on natural language, determining a region of interest corresponding to a task to be executed in the two-dimensional image according to the task instruction text, predicting depth information of the region of interest, obtaining key feature points in the three-dimensional point cloud based on the region of interest and the depth information thereof, mapping each key feature point to a two-dimensional feature space to obtain a three-dimensional position indication vector of each key feature point, and determining the target pose of an end effector of the robot according to the three-dimensional position indication vector. The pose of the end effector of the robot is predicted by adopting a mode of combining implicit depth perception and explicit space mapping, so that semantic understanding and space perception can be considered, the accuracy of pose prediction is improved, and the operation performance, robustness and generalization capability of the robot are improved.

Inventors

ZHANG SHANGHANG
Jia Yueru
LIU JIAMING
CHEN SIXIANG
GU CHENYANG
Luo Longzan

Assignees

北京大学

Dates

Publication Date: 20260505
Application Date: 20251222

Claims (10)

1. The robot pose determining method is characterized by comprising the following steps of: the method comprises the steps of acquiring multi-modal data of a robot working scene, wherein the multi-modal data comprises a two-dimensional image, a three-dimensional point cloud and a task instruction text based on natural language, and the task instruction text is used for indicating a task to be executed of the robot; determining a region of interest corresponding to the task to be executed in the two-dimensional image according to the task instruction text, and predicting depth information of the region of interest; Acquiring key feature points in the three-dimensional point cloud based on the depth information and the region of interest; mapping each key feature point to a two-dimensional feature space to obtain a three-dimensional position indication vector of each key feature point; and determining the target pose of the robot end effector according to the three-dimensional position indication vector.
2. The robot pose determination method according to claim 1, wherein the step of mapping each of the key feature points to a two-dimensional feature space to obtain a three-dimensional position indication vector of each of the key feature points comprises: constructing a virtual geometric body surrounding each key feature point, and setting up virtual projection planes with a plurality of different visual angles on the surface of the virtual geometric body; and mapping each key feature point to a two-dimensional feature space through the virtual projection plane to obtain a three-dimensional position indication vector of each key feature point.
3. The method of claim 2, wherein the step of mapping each key feature point to a two-dimensional feature space through the virtual projection plane to obtain a three-dimensional position indication vector of each key feature point comprises: establishing an index mapping relation between a three-dimensional space where the three-dimensional point cloud is located and a two-dimensional plane corresponding to the virtual projection plane; Calculating projection point coordinates of the key feature points on each virtual projection plane based on the index mapping relation; acquiring two-dimensional position codes corresponding to the coordinates of each projection point; calculating according to the geometric relationship between the normal vector of the target feature point in the three-dimensional space and the observation direction of the virtual projection plane to obtain an effectiveness weight, wherein the effectiveness weight characterizes the projection effectiveness; and carrying out weighted aggregation on two-dimensional position codes corresponding to projection point coordinates of the target feature points on the virtual projection planes according to the effectiveness weights to obtain three-dimensional position indication vectors of the target feature points, wherein the target feature points are any one of the key feature points.
4. A robot pose determination method according to any of claims 1 to 3, wherein the steps of determining a region of interest corresponding to the task to be performed in the two-dimensional image according to the task instruction text, and predicting depth information of the region of interest include: respectively extracting text features of the task instruction text and visual features of the two-dimensional image; Calculating a similarity matrix of the visual features and the text features, and generating a task attention thermodynamic diagram according to the similarity matrix, wherein the task attention thermodynamic diagram is used for representing semantic relevance between each region in the two-dimensional image and the task instruction text; Determining a corresponding region of interest of the task to be executed in the two-dimensional image according to the task focused thermodynamic diagram; Inputting the image blocks corresponding to the region of interest into a visual basic model to obtain depth information of the region of interest output by the visual basic model; The visual basic model comprises a backbone network and a geometric reconstruction decoder, wherein the depth information is obtained by carrying out depth prediction on a block of the region of interest by the geometric reconstruction decoder, the geometric reconstruction decoder is obtained by carrying out pre-training on the basis of a mixed loss function under the condition of freezing model parameters of the backbone network, the mixed loss function comprises a depth reconstruction loss and a characteristic distillation loss, the depth reconstruction loss represents the difference between the predicted depth and the real depth of the geometric reconstruction decoder, and the characteristic distillation loss is used for restraining the consistency of the current output and the original output of the visual basic model.
5. The method according to claim 4, wherein the step of determining a region of interest corresponding to the task to be performed in the two-dimensional image according to the task focused thermodynamic diagram includes: Determining a dynamic threshold according to the numerical distribution of each pixel point in the task attention thermodynamic diagram; generating a mask matrix based on the dynamic threshold by adopting a non-uniform mask strategy; And performing mask operation on the two-dimensional image according to the mask matrix to divide a foreground area and a background area of the two-dimensional image, wherein the foreground area is an area of interest corresponding to the task to be executed in the two-dimensional image.
6. The method of claim 4, wherein the step of determining the target pose of the robotic end effector from the three-dimensional position indication vector comprises: inputting the key feature points into a pre-trained three-dimensional word segmentation device, aggregating local geometric features of the key feature points through the three-dimensional word segmentation device, and mapping the local geometric features into high-dimensional features through a multi-layer perceptron; Splicing and fusing the high-dimensional features and the three-dimensional position indication vector to obtain a feature sequence with explicit space position information; Acquiring a body state of the robot, wherein the body state comprises an initial pose of an end effector of the robot; The feature sequence and the ontology state are input into the visual basic model, wherein the visual basic model further comprises a low-rank adapter and a strategy header, and the low-rank adapter is obtained by performing fine tuning training under the condition of freezing model parameters of the backbone network; And predicting the output characteristics of the low-rank adapter through the strategy head to obtain the target pose of the robot end effector output by the visual basic model.
7. A robot pose determining device, characterized in that the robot pose determining device comprises: The system comprises a data acquisition module, a robot working scene processing module and a robot working scene processing module, wherein the data acquisition module is used for acquiring multi-mode data of the robot working scene, the multi-mode data comprises a two-dimensional image, a three-dimensional point cloud and a task instruction text based on natural language, and the task instruction text is used for indicating a task to be executed of the robot; The depth perception module is used for determining a region of interest corresponding to the task to be executed in the two-dimensional image according to the task instruction text and predicting depth information of the region of interest; the downsampling module is used for acquiring key feature points in the three-dimensional point cloud based on the depth information and the region of interest; The three-dimensional mapping module is used for mapping each key feature point to a two-dimensional feature space to obtain a three-dimensional position indication vector of each key feature point; and the pose prediction module is used for determining the target pose of the robot end effector according to the three-dimensional position indication vector.
8. A robot control device, characterized in that the device comprises a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the robot pose determination method according to any of claims 1 to 6.
9. A storage medium, characterized in that the storage medium is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, realizes the steps of the robot pose determination method according to any of claims 1 to 6.
10. A computer program product, characterized in that the computer program product comprises a computer program which, when being executed by a processor, implements the steps of the robot pose determination method according to any of claims 1 to 6.

Description

Robot pose determining method, apparatus, device, storage medium and program product Technical Field The present application relates to the field of machine learning technologies, and in particular, to a method, an apparatus, a device, a storage medium, and a program product for determining a pose of a robot. Background The smart operation of a robot based on visual perception is a core research direction in the robot field, aiming at enabling the robot to understand an unstructured environment through a visual sensor and predicting the action gesture of an end effector when performing tasks according to the unstructured environment. At present, the pose prediction of the end effector of the robot usually uses two-dimensional images or three-dimensional point cloud data in isolation, and semantic understanding and spatial perception are difficult to be considered, so that the prediction performance of the pose of the robot is poor, and the operation performance, the robustness and the generalization capability of the robot are insufficient. The method is characterized in that the pre-training characterization method based on the two-dimensional images directly utilizes the features extracted from a single two-dimensional image or a plurality of two-dimensional images, obtains strong semantic prior by means of a visual model pre-trained on large-scale internet data, and predicts the action of the robot in a mode of simulating learning and the like. However, the two-dimensional image is essentially a projection of the three-dimensional world, severely lacking depth and precise spatial geometric information, which makes it difficult for the model to accurately determine the relative position and distance of objects when the task of precise spatial reasoning (e.g., jack, stacking) or handling complex occlusion is required. In addition, in a scene with disordered background, due to missing depth information and space geometric information, the model is difficult to effectively separate a foreground operation object from background interference, so that operation robustness is insufficient. The end-to-end strategy learning method based on the three-dimensional point cloud directly processes geometric data such as the three-dimensional point cloud or voxels, learns characteristics and predicts pose actions through a special three-dimensional neural network, and has accurate space perception capability theoretically. However, the model is subject to serious scarcity of high-quality three-dimensional operation data, lacks a large-scale pre-training basis, and the semantics learned by the model are often statistical association of geometric forms, but are not high-level and movable abstract semantic concepts, so that task understanding and planning capability are lost, the model generalization capability is weak, and the model is difficult to adapt to unseen objects or environments. The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present application and is not intended to represent an admission that the foregoing is prior art. Disclosure of Invention The application mainly aims to provide a method, a device, equipment, a storage medium and a program product for determining the pose of a robot, which aim to solve the technical problems that the pose of the robot is predicted with poor performance, resulting in insufficient operation performance, robustness and generalization capability of the robot, which are difficult to consider semantic understanding and space perception when two-dimensional images or three-dimensional point clouds are used in isolation. In order to achieve the above object, the present application provides a method for determining a pose of a robot, the method comprising: the method comprises the steps of acquiring multi-modal data of a robot working scene, wherein the multi-modal data comprises a two-dimensional image, a three-dimensional point cloud and a task instruction text based on natural language, and the task instruction text is used for indicating a task to be executed of the robot; determining a region of interest corresponding to the task to be executed in the two-dimensional image according to the task instruction text, and predicting depth information of the region of interest; Acquiring key feature points in the three-dimensional point cloud based on the depth information and the region of interest; mapping each key feature point to a two-dimensional feature space to obtain a three-dimensional position indication vector of each key feature point; and determining the target pose of the robot end effector according to the three-dimensional position indication vector. In an embodiment, the step of mapping each key feature point to a two-dimensional feature space to obtain a three-dimensional position indication vector of each key feature point includes: constructing a virtual geometric body surrounding each key feature po