CN-121989233-A - Robot control method and device, electronic equipment, storage medium and product

CN121989233ACN 121989233 ACN121989233 ACN 121989233ACN-121989233-A

Abstract

The invention discloses a robot control method, a robot control device, electronic equipment, a storage medium and a product, and relates to the technical field of artificial intelligence. The method comprises the steps of obtaining visual data of an operation scene of a target robot, text instruction data of a target task and body state characteristics of the target robot, respectively determining the visual characteristics of the visual data and the text characteristics of the text instruction data, determining multi-modal characteristics according to the visual characteristics, the text characteristics, the body state characteristics and preset diffusion steps, sampling initial motion noise from Gaussian distribution, determining target multi-modal characteristics according to the initial motion noise and the multi-modal characteristics, determining a robot motion sequence corresponding to the target multi-modal characteristics based on a preset diffusion model, wherein the preset diffusion model is constructed by a noise prediction network and a denoising diffusion implicit model, the preset noise prediction network is formed by a diffusion converter, and controlling the target robot to execute the target task based on the robot motion sequence to accurately determine the robot motion sequence.

Inventors

Request for anonymity
Request for anonymity
Request for anonymity
Request for anonymity

Assignees

星海图(北京)人工智能科技股份有限公司

Dates

Publication Date: 20260508
Application Date: 20260116

Claims (10)

1. A robot control method, comprising: Acquiring visual data of an operation scene of a target robot, text instruction data of a target task and body state characteristics of the target robot, and respectively determining the visual characteristics of the visual data and the text characteristics of the text instruction data; determining multi-modal features according to the visual features, the text features, the body state features and preset diffusion steps; Sampling initial motion noise from Gaussian distribution, determining target multi-modal characteristics according to the initial motion noise and the multi-modal characteristics, and determining a robot motion sequence corresponding to the target multi-modal characteristics based on a preset diffusion model, wherein the preset diffusion model is constructed by a noise prediction network and a denoising diffusion implicit model, and the preset noise prediction network is formed by a diffusion converter; And controlling a target robot to execute the target task based on the robot action sequence.
2. The method according to claim 1, wherein the obtaining visual data of the operation scene of the target robot, text instruction data of the target task, and the body state feature of the target robot, determining the visual feature of the visual data, and the text feature of the text instruction data, respectively, includes: reading continuous frame images in an operation scene acquired by a preset vision sensor as vision data, inputting the vision data into a pre-trained vision encoder, and extracting a classification mark output vector output by the vision encoder as a vision feature; Identifying text instruction data of a target task, inputting the text instruction data into a pre-trained text encoder, and extracting feature vectors output by the text encoder as text features of the text instruction data, wherein the visual features and the text features are target feature dimensions; Acquiring body sensor data of a target robot, mapping the body sensor data to a target feature dimension through a linear projection layer to obtain a numerical feature, and taking the mapped numerical feature as a body state feature.
3. The method of claim 1, wherein the determining a multimodal feature from the visual feature, the text feature, the ontology state feature, and a preset diffusion step comprises: Expanding the visual features, the text features and the body state features according to preset time steps and aligning time sequences to serve as initial multi-modal features; Mapping a preset diffusion step into a high-dimensional vector serving as a diffusion step code through a sine position code and a full-connection network, and embedding the diffusion step code into the initial multi-mode characteristic serving as the multi-mode characteristic.
4. The method of claim 1, wherein the sampling the initial motion noise from the gaussian distribution, determining a target multi-modal feature from the initial motion noise and the multi-modal feature, and determining a robot motion sequence corresponding to the target multi-modal feature based on a preset diffusion model, comprises: Sampling a random vector consistent with the multi-modal feature dimension from Gaussian distribution as initial motion noise, mapping the initial motion noise to the feature dimension identical with the multi-modal feature, and obtaining projected motion noise; fusing the projected action noise with the multi-modal characteristics to obtain target multi-modal characteristics, and inputting the target multi-modal characteristics into a preset diffusion model; determining noise estimation corresponding to the target multi-mode characteristics based on a noise prediction network in a preset diffusion model; And carrying out iterative denoising according to the noise estimation based on a denoising diffusion implicit model in a preset diffusion model to obtain a robot action sequence.
5. The method of claim 1, wherein the controlling a target robot to perform the target task based on the sequence of robot actions comprises: Determining a sequence segment to be executed in the robot action sequence, and controlling a target robot to execute the target task through the sequence segment to be executed.
6. The method of claim 1, wherein the training process of the noise prediction network comprises: Acquiring a training data set, wherein each historical data sample in the training data set comprises a continuous visual image, a historical task text instruction, a historical robot body state and a corresponding real action sequence; determining visual features of the continuous visual image as historical visual features and determining text features of the historical task text instructions as historical text features; Randomly sampling a diffusion step from the preset total diffusion step number, and adding target S noise to the real action sequence according to the noise intensity corresponding to the diffusion step to generate a noisy action sequence; The diffusion step is coded by combining sinusoidal position coding with two layers of fully-connected networks to obtain diffusion step coding characteristics, and the historical visual characteristics, the historical text characteristics, the historical robot body state and the diffusion step coding characteristics are fused to obtain training condition input; And inputting training conditions and noisy action sequences into a noise prediction network to generate prediction noise, taking the mean square error of the prediction noise and the target Gaussian noise as a loss function, updating model parameters through a gradient descent algorithm until the loss converges to a preset stable interval, and finishing training.
7. A robot control device, comprising: The feature acquisition module is used for acquiring visual data of an operation scene of the target robot, text instruction data of a target task and body state features of the target robot, and determining the visual features of the visual data and the text features of the text instruction data respectively; The feature fusion module is used for determining multi-modal features according to the visual features, the text features, the body state features and preset diffusion steps; The system comprises a motion determination module, a target multi-modal feature determination module and a target multi-modal feature determination module, wherein the motion determination module is used for sampling initial motion noise from Gaussian distribution, determining a target multi-modal feature according to the initial motion noise and the multi-modal feature, and determining a robot motion sequence corresponding to the target multi-modal feature based on a preset diffusion model, wherein the preset diffusion model is constructed by a noise prediction network and a denoising diffusion implicit model, and the preset noise prediction network is formed by a diffusion converter; And the robot control module is used for controlling a target robot to execute the target task based on the robot action sequence.
8. An electronic device, the electronic device comprising: At least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the robot control method of any one of claims 1-6.
9. A computer readable storage medium, characterized in that the computer readable storage medium stores computer instructions for causing a processor to implement the robot control method of any one of claims 1-6 when executed.
10. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, implements the robot control method according to any of claims 1-6.

Description

Robot control method and device, electronic equipment, storage medium and product Technical Field The present invention relates to the field of artificial intelligence, and in particular, to a robot control method, apparatus, electronic device, storage medium, and product. Background The existing robot control system generally adopts a behavior cloning and other imitation learning method, and directly returns control commands through sensor observation (images, force feedback, positions and the like) input to a model. Such single task strategies are capable of accomplishing specific actions with limited demonstration data support, but generally exhibit vulnerability to task changes, environmental disturbances, or complex long-time sequence actions. Researchers in recent years have proposed the concept of "large behavioral models (Large Behavior Models, LBMs)" to enhance generalization by pre-training on large-scale multi-task datasets, followed by fine-tuning on specific tasks. Numerous experiments have demonstrated that a multitasking pre-trained model requires fewer examples on new tasks to achieve a similar or even better success rate than a single task model trained from scratch. Under the framework of a large behavior model, a Diffusion Policy (Diffusion Policy) is attracting attention because it can generate long-time-sequence, multi-modal actions. Diffusion models have made breakthrough progress in the field of image generation by progressively denoising from gaussian noise to obtain an output of complex structure. Recent work uses the diffusion concept for motion generation of robots so that the model can not only output a multi-modal, continuous sequence of motions, but also share model structures and parameters between different tasks. However, the existing multitask diffusion strategy often only uses visual perception, lacks high-level instruction guidance of natural language, or a model structure is only suitable for a single task, cannot consider multi-object operation, and has poor accuracy of determining a robot action sequence. Therefore, how to overcome the difficulty in task distinction caused by lack of language instruction guidance in the conventional multitasking strategy and to improve the accuracy of determining the motion sequence of the robot becomes a current urgent problem to be solved. Disclosure of Invention The invention provides a robot control method, a device, electronic equipment, a storage medium and a product, which are used for solving the problems that a model structure is only suitable for a single task and the accuracy of a robot action sequence is low in the prior art. According to an aspect of the present invention, there is provided a robot control method, wherein the method includes: Acquiring visual data of an operation scene of a target robot, text instruction data of a target task and body state characteristics of the target robot, and respectively determining the visual characteristics of the visual data and the text characteristics of the text instruction data; determining multi-modal features according to the visual features, the text features, the body state features and preset diffusion steps; Sampling initial motion noise from Gaussian distribution, determining target multi-modal characteristics according to the initial motion noise and the multi-modal characteristics, and determining a robot motion sequence corresponding to the target multi-modal characteristics based on a preset diffusion model, wherein the preset diffusion model is constructed by a noise prediction network and a denoising diffusion implicit model, and the preset noise prediction network is formed by a diffusion converter; And controlling a target robot to execute the target task based on the robot action sequence. According to another aspect of the present invention, there is provided a robot control device, wherein the device includes: The feature acquisition module is used for acquiring visual data of an operation scene of the target robot, text instruction data of a target task and body state features of the target robot, and determining the visual features of the visual data and the text features of the text instruction data respectively; The feature fusion module is used for determining multi-modal features according to the visual features, the text features, the body state features and preset diffusion steps; The system comprises a motion determination module, a target multi-modal feature determination module and a target multi-modal feature determination module, wherein the motion determination module is used for sampling initial motion noise from Gaussian distribution, determining a target multi-modal feature according to the initial motion noise and the multi-modal feature, and determining a robot motion sequence corresponding to the target multi-modal feature based on a preset diffusion model, wherein the preset diffusion model is constructed by a noise prediction network and