CN-121989228-A - Robot control method, device and equipment based on video drive and robot

CN121989228ACN 121989228 ACN121989228 ACN 121989228ACN-121989228-A

Abstract

The invention provides a robot control method, device and equipment based on video drive and a robot, and relates to the technical field of robot motion control. The method comprises the steps of obtaining video data, extracting visual semantic latent variables through a pre-trained visual language model, inputting the visual semantic latent variables into a motion latent variable reconstruction module to be processed, outputting the motion latent variables, splicing the motion latent variables, the current body perception state of the robot and the historical body perception state to form strategy observation vectors, inputting the strategy observation vectors into a pre-trained diffusion student strategy network, generating joint action instructions of the robot through a denoising diffusion process, and sending the joint action instructions to a joint driver of the robot to drive the robot to execute actions matched with video semantics. The embodiment of the invention realizes the end-to-end direct mapping from video semantics to robot joint actions, and effectively avoids the problems of error accumulation and delay in the traditional multistage process.

Inventors

CHI CHENG
LI ZHE
Zhu Boan
WEI YANGYANG
WANG PENGWEI
WANG ZHONGYUAN
ZHANG SHANGHANG

Assignees

北京智源人工智能研究院

Dates

Publication Date: 20260508
Application Date: 20251229

Claims (10)

1. A robot control method based on video driving, comprising: acquiring video data, and extracting visual semantic latent variables of the video data through a pre-trained visual language model; inputting the visual semantic latent variable to a pre-trained motion latent variable reconstruction module for processing, and outputting a motion latent variable related to kinematics; Splicing the motion latent variable, the current body perception state of the robot and the historical body perception state of the robot to form a strategy observation vector, inputting the strategy observation vector into a pre-trained diffusion student strategy network, and generating a joint action instruction of the robot through a denoising diffusion process; and sending the joint action instruction to a joint driver of the robot so as to drive the robot to execute actions matched with video semantics.
2. The video-driven-based robot control method according to claim 1, wherein the extracting visual semantic latent amounts of the video data by the pre-trained visual language model comprises: Determining corresponding video text prompts according to the video types of the video data, wherein each type of the video data corresponds to each video text prompt; Inputting the video data and the corresponding preset text prompt into a visual language model based on a transducer architecture, extracting a feature vector output by the last layer of the visual language model, and taking the feature vector output by the last layer as the visual semantic latent variable.
3. The method for controlling a video-driven robot according to claim 2, wherein, The motion latent variable reconstruction module is a diffusion converter network based on stream matching, the diffusion converter network comprises a backbone network formed by a plurality of diffusion converter blocks connected in sequence, and each diffusion converter block comprises: A cross attention sub-layer for aligning the visual semantic latent variable as a conditional signal with an intermediate feature; And the self-attention sublayer is used for modeling the time step dependency relationship inside the motion latent variable.
4. The method of claim 3, wherein the training step of the motion latent variable reconstruction module comprises: acquiring a pairing data set formed by video data and corresponding reference human motion sequence data; Encoding the reference human motion sequence data into a reference motion latent variable through a pre-trained variation self-encoder; adding noise to the reference motion latent variable in the flow matching time step to construct a noise-containing motion latent variable; inputting the noise-containing motion latent variable and a time step into the diffusion converter network on the condition of visual semantic latent variable corresponding to the video data; the diffusion transformer network is trained to predict a velocity vector field, minimizing a velocity prediction penalty between the predicted velocity vector field and a true velocity vector field.
5. The video-drive-based robot control method of any one of claims 1 to 4, wherein the diffuse student policy network is distilled from a hybrid expert teacher policy network; The mixed expert teacher policy network is a policy model based on a gating network and a plurality of expert networks, and the training and distilling steps of the mixed expert teacher policy network comprise: in a physical simulation environment, privilege information, a robot body perception state and a reference moving target are used as joint input, and a near-end strategy optimization algorithm is adopted to train the mixed expert teacher strategy network; And taking the motion latent variable and the robot body perception state generated from video data as condition input, inquiring the mixed expert teacher strategy network through the collected observation state data in the physical simulation environment, and acquiring an optimal action label corresponding to the observation state data so as to train the diffusion student strategy network through a supervision training mode and the optimal action label.
6. The video-drive-based robot control method according to claim 5, characterized in that the method comprises: The gate control network is used for dynamically outputting normalized weights of all expert networks based on the input robot body perception state, and the final output action of the mixed expert teacher strategy network The calculation formula of (2) is as follows: Wherein, the For the final output action of the hybrid expert teacher policy network, For the purpose of referencing the joint position, The gating network is assigned a normalized weight for the i-th expert network, And K is the number of the expert networks for the output action of the ith expert network.
7. The video-driven-based robot control method according to claim 5, wherein the training step of the diffusion student policy network comprises: Constructing a denoising network taking a multi-layer perceptron as a backbone; In a physical simulation environment, interacting a current version of diffusion student strategy network with the physical simulation environment, and collecting observation state data containing the perception state of a robot body; inputting the acquired observation state data into a trained mixed expert teacher strategy network to acquire a corresponding optimal action label; inputting the optimal action label, the diffusion time step information and the strategy observation vector into the denoising network, adding noise to the optimal action label in a forward diffusion process, and training the denoising network to perform denoising to obtain a trained diffusion student strategy network, wherein an optimization target of the trained diffusion student strategy network is to minimize a mean square error between a network prediction action and the optimal action label.
8. A robot control device based on video driving, comprising: the extraction module is used for acquiring video data and extracting visual semantic latent variables of the video data through a pre-trained visual language model; The reconstruction module is used for inputting the visual semantic latent variable to the pre-training motion latent variable reconstruction module for processing and outputting a motion latent variable related to kinematics; the generation module is used for splicing the motion latent variable, the current body perception state of the robot and the historical body perception state of the robot to form a strategy observation vector, inputting the strategy observation vector into a pre-trained diffusion student strategy network and generating a joint action instruction of the robot through a denoising diffusion process; And the control module is used for sending the joint action instruction to a joint driver of the robot so as to drive the robot to execute actions matched with video semantics.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, characterized in that the processor implements the video drive based robot control method according to any of claims 1 to 7 when executing the program.
10. A robot comprising a robot body, a robot body and a robot body, characterized by comprising the following steps: One or more processors; a memory storing a computer program; A robot body including a plurality of joint drivers controllable by the processor; Wherein the computer program, when executed by the one or more processors, causes the robot to implement the video drive based robot control method of any one of claims 1 to 7.

Description

Robot control method, device and equipment based on video drive and robot Technical Field The present invention relates to the field of robot motion control technologies, and in particular, to a method, an apparatus, a device, and a robot for controlling a robot based on video driving. Background Currently, in the field of motion control of humanoid robots based on video guidance, the main technical route generally follows a multi-stage serial processing flow. The method comprises the steps of firstly estimating an explicit human body gesture or motion sequence from an input video by utilizing a video analysis model, then adapting and converting estimated motion data suitable for human body morphology into a reference track conforming to the morphological and kinematic constraints of a target robot through a motion redirection technology, and finally tracking and executing the reference track by a physical controller at the bottom layer. However, the above-described paradigm of "explicitly resolving a redirect a trace" has inherent drawbacks. As the stages of motion analysis, redirection and control are designed and optimized by cutting, errors generated by the preamble link are transmitted and accumulated step by step, so that the finally generated robot action is obviously reduced in semantic fidelity and physical rationality. Meanwhile, higher calculation and deployment delay is introduced in the multistage serial processing, and the real-time man-machine interaction requirement is difficult to meet. In addition, the strong coupling between the modules also limits the generalization ability of the system to unseen video scenes, motion patterns, and different robot platforms. Disclosure of Invention The invention provides a robot control method, a device, equipment and a robot based on video driving, which realize end-to-end direct mapping from video semantics to robot joint actions and effectively avoid the problems of error accumulation and delay in the traditional multistage process. In a first aspect, the present invention provides a robot control method based on video driving, including the steps of: acquiring video data, and extracting visual semantic latent variables of the video data through a pre-trained visual language model; inputting the visual semantic latent variable to a pre-trained motion latent variable reconstruction module for processing, and outputting a motion latent variable related to kinematics; Splicing the motion latent variable, the current body perception state of the robot and the historical body perception state of the robot to form a strategy observation vector, inputting the strategy observation vector into a pre-trained diffusion student strategy network, and generating a joint action instruction of the robot through a denoising diffusion process; and sending the joint action instruction to a joint driver of the robot so as to drive the robot to execute actions matched with video semantics. Preferably, according to the method for controlling a robot based on video driving provided by the invention, the extracting the visual semantic latent variable of the video data through the pre-trained visual language model comprises the following steps: Determining corresponding video text prompts according to the video types of the video data, wherein each type of the video data corresponds to each video text prompt; Inputting the video data and the corresponding preset text prompt into a visual language model based on a transducer architecture, extracting a feature vector output by the last layer of the visual language model, and taking the feature vector output by the last layer as the visual semantic latent variable. Preferably, according to the robot control method based on video driving provided by the invention, The motion latent variable reconstruction module is a diffusion converter network based on stream matching, the diffusion converter network comprises a backbone network formed by a plurality of diffusion converter blocks connected in sequence, and each diffusion converter block comprises: A cross attention sub-layer for aligning the visual semantic latent variable as a conditional signal with an intermediate feature; And the self-attention sublayer is used for modeling the time step dependency relationship inside the motion latent variable. Preferably, according to the robot control method based on video driving provided by the invention, The training steps of the motion latent variable reconstruction module comprise: acquiring a pairing data set formed by video data and corresponding reference human motion sequence data; Encoding the reference human motion sequence data into a reference motion latent variable through a pre-trained variation self-encoder; adding noise to the reference motion latent variable in the flow matching time step to construct a noise-containing motion latent variable; inputting the noise-containing motion latent variable and a time step into the