CN-122008228-A - Robot control method and robot

CN122008228ACN 122008228 ACN122008228 ACN 122008228ACN-122008228-A

Abstract

The application provides a robot control method and a robot, wherein the method comprises the steps of realizing gesture estimation, action generation and end-to-end closed loop of robot control through a unified conditional diffusion model. The method comprises the steps of combining a robot working image with a noisy gesture vector, obtaining accurate gesture information through denoising, generating a sparse video frame based on the working image and a task instruction and splitting the sparse video frame into continuous frame pairs, denoising a noisy action sequence through a conditional diffusion model to obtain a dense target action sequence, and controlling the robot to execute by taking the gesture information as a reference. The application simplifies the system architecture without additional auxiliary modes or independent modules, generates a continuous and smooth action sequence, can be physically executed, greatly improves the accuracy and efficiency of robot control, ensures the stability of task execution, and is suitable for scenes such as man-machine interaction, object control and the like.

Inventors

ZHANG HAOZHUO
LIU PEIRAN
ZHANG ZHANG
ZHANG QIANG
SUN PIHAI
SUN JINGKAI
CUI WEI
ZHAO WEN
HAN GANG
GUO YIJIE
Su Zeran
SHI SHUAI
MA JIAHAO
WANG RENPENG
MENG XIANG
Yong Zhe
WANG ZHEN
LI YUANZHUO

Assignees

北京人形机器人创新中心有限公司

Dates

Publication Date: 20260512
Application Date: 20260320

Claims (10)

1. A robot control method, comprising: acquiring attitude information of a robot according to a robot working image and acquired robot attitude vectors, wherein the robot working image comprises a working part and an environment state of the robot; generating a sparse video frame according to the robot working image and the task instruction, and splitting the sparse video frame into a plurality of continuous video frame pairs; inputting each video frame pair into a pre-trained conditional diffusion model, and denoising the noisy action sequence according to each video frame pair by using the conditional diffusion model to obtain a target action sequence of the robot; and controlling the motion of the robot according to the target motion sequence and the gesture information.
2. The method of claim 1, wherein the acquiring the pose information of the robot from the robot work image and the acquired robot pose vector comprises: And denoising the attitude vector of the robot according to the working image of the robot, the first diffusion time step and the pre-trained conditional diffusion model to obtain the attitude information of the robot.
3. The method according to claim 2, wherein denoising the pose vector of the robot according to the robot working image, the first diffusion time step and the pre-trained conditional diffusion model to obtain the pose information of the robot comprises: encoding the robot working image based on the conditional diffusion model to obtain image features; encoding the first diffusion time step based on the conditional diffusion model to obtain a first timing characteristic; And denoising the attitude vector by the conditional diffusion model according to the image feature and the first time sequence feature to obtain the attitude information of the robot.
4. A method according to claim 3, wherein the denoising the pose vector by the conditional diffusion model according to the image feature and the first timing feature, and the process of obtaining the pose information of the robot comprises: splicing the image features and the first time sequence features to obtain first condition features; Performing characteristic linear modulation on the first condition characteristic to obtain a first modulation characteristic, wherein the first modulation characteristic comprises a scale modulation characteristic vector and a bias modulation characteristic vector; Coding the attitude vector to obtain an attitude characteristic; linearly modulating the gesture feature according to the first modulation feature to obtain a processed feature; and carrying out combined processing on the processed features and the gesture features to obtain the gesture information.
5. The method of claim 1, wherein the splitting the sparse video frame into a plurality of consecutive pairs of video frames comprises: traversing the sparse video frame by adopting a sliding window from the first video frame of the sparse video frame, and combining the video frames in the sliding window into a video frame pair; and moving the sliding window according to a preset step length and a preset moving direction to obtain a plurality of video frame pairs.
6. The method of claim 1, wherein the conditional diffusion model performs denoising processing on the noisy motion sequence according to each video frame to obtain a target motion sequence of the robot, and the method comprises: encoding the video frames in the video frame pair based on the conditional diffusion model to obtain video frame characteristics; randomly sampling a second diffusion time step, and encoding the second diffusion time step based on the conditional diffusion model to obtain a second time sequence characteristic; And denoising the noisy action sequence by the conditional diffusion model according to the video frame characteristics and the second time sequence characteristics to obtain the target action sequence.
7. The method of claim 6, wherein the conditional diffusion model denoising the noisy motion sequence according to the video frame feature and the second timing feature to obtain the target motion sequence, comprising: Splicing the video frame features and the second time sequence features to obtain second condition features; Performing characteristic linear modulation on the second condition characteristic to obtain a second modulation characteristic, wherein the second modulation characteristic comprises a scale modulation characteristic vector and a bias modulation characteristic vector; coding the noisy action sequence to obtain action characteristics; Performing linear modulation on the action characteristic according to the second modulation characteristic to obtain a modulated characteristic; combining the modulated features and the action features to obtain an initial action sequence; and performing de-duplication processing on the initial action sequence to obtain a target action sequence.
8. The method of claim 7, wherein performing the deduplication process on the initial motion sequence to obtain the target motion sequence comprises: Determining at least one overlapping action group in the initial action sequence; And traversing each overlapping action group, determining an average value of each overlapping action group, and taking the average value as the overlapping action corresponding to the overlapping action group to obtain the target action sequence.
9. The method of claim 1, wherein the performing motion control on the robot according to the target motion sequence and the gesture information comprises: And controlling the robot to execute each action in the target action sequence according to the sequence of the target action sequence by taking the gesture information as a reference so as to achieve the target gesture of the target action sequence.
10. A robot comprising the steps of performing the robot control method according to any one of claims 1 to 9.

Description

Robot control method and robot Technical Field The application relates to the technical field of robot control, in particular to a robot control method and a robot. Background The robot gesture estimation and inverse dynamics modeling are core links of robot control, and are widely applied to scenes such as man-machine interaction, multi-robot cooperation, object control and the like. The former needs to accurately extract the 3D key points or joint angles of the robot from visual observation, and the latter needs to convert task related videos into continuous executable action sequences. With the successful application of diffusion models in the robot field, a unified framework is needed to solve the high-precision requirement of single-image gesture estimation and the high-efficiency requirement of sparse video-to-action mapping. The existing robot gesture estimation technology is divided into two types, wherein the first type relies on marks in images or real-time joint feedback and combines camera internal parameters to calculate gestures, and the second type is to predict depth maps/2D key points and convert the depth maps/2D key points into 3D gestures to realize joint information-free estimation. In the aspect of inverse dynamic modeling, most of the existing methods are independent modules, video frames generated based on a world model are output through iterative reasoning, and diffusion strategy optimization actions are partially adopted to generate. However, the precision and the speed are limited due to the multi-stage flow of the gesture estimation, the auxiliary mode or the prior information is relied on, the existing inverse dynamic model is mostly in an off-line-on-line mixed mode, and the generated action sequence is sparse and discontinuous and has high delay. The two tasks lack a unified frame, the complexity of the system is high, and the stability requirements of real-time control and long-time task execution of the robot are difficult to meet. Disclosure of Invention The application aims to overcome the defects in the prior art, and provides a robot control method and a robot, so as to solve the problems that the existing gesture estimation and inverse dynamics modeling task frames in the prior art are not uniform, the system complexity of the robot is high, and the stability requirements of real-time control and long-time task execution of the robot are difficult to meet. In order to achieve the above purpose, the technical scheme adopted by the application is as follows: in a first aspect, the present application provides a robot control method, the method comprising: acquiring attitude information of a robot according to a robot working image and acquired robot attitude vectors, wherein the robot working image comprises a working part and an environment state of the robot; generating a sparse video frame according to the robot working image and the task instruction, and splitting the sparse video frame into a plurality of continuous video frame pairs; inputting each video frame pair into a pre-trained conditional diffusion model, and denoising the noisy action sequence according to each video frame pair by using the conditional diffusion model to obtain a target action sequence of the robot; and controlling the motion of the robot according to the target motion sequence and the gesture information. Optionally, the acquiring the pose information of the robot according to the robot working image and the acquired robot pose vector includes: And denoising the attitude vector of the robot according to the working image of the robot, the first diffusion time step and the pre-trained conditional diffusion model to obtain the attitude information of the robot. Optionally, the denoising processing is performed on the gesture vector of the robot according to the working image of the robot, the first diffusion time step and the pre-trained conditional diffusion model, so as to obtain gesture information of the robot, including: encoding the robot working image based on the conditional diffusion model to obtain image features; encoding the first diffusion time step based on the conditional diffusion model to obtain a first timing characteristic; And denoising the attitude vector by the conditional diffusion model according to the image feature and the first time sequence feature to obtain the attitude information of the robot. Optionally, the denoising processing is performed on the gesture vector by the conditional diffusion model according to the image feature and the first timing feature, and the process of obtaining the gesture information of the robot includes: splicing the image features and the first time sequence features to obtain first condition features; Performing characteristic linear modulation on the first condition characteristic to obtain a first modulation characteristic, wherein the first modulation characteristic comprises a scale modulation characteristic vector and a bias