CN-122008241-A - Object grabbing action track generation method and device

CN122008241ACN 122008241 ACN122008241 ACN 122008241ACN-122008241-A

Abstract

The invention discloses a method and a device for generating an object grabbing action track, wherein the method comprises the following steps of obtaining multi-mode input information, processing the multi-mode input information to obtain language features, visual features and quantity condition parameters, inputting the language features, the visual features, the quantity condition parameters and a noisy action track into a pre-trained conditional diffusion model, and generating the object grabbing action track matched with the quantity condition parameters. According to the invention, the number of the target grabbing is used as an independent and powerful condition to be directly injected into the end-to-end continuous track generation process, so that the robot can truly understand the number and generate a complex and continuous object grabbing action track suitable for any number.

Inventors

JIANG CHENCHEN
ZHANG JIAN
Chen Linjia
ZHANG PING
FENG JUN
WANG SHAOBAO

Assignees

立昂(深圳)机器人科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260402

Claims (10)

1. The object grabbing action track generation method is characterized by comprising the following steps of: Acquiring multi-mode input information, wherein the multi-mode input information comprises natural language instructions and visual data of a robot working scene; processing the multi-mode input information to obtain language characteristics, visual characteristics and quantity condition parameters; Inputting the language features, the visual features, the quantity condition parameters and a random noise into a pre-trained condition diffusion model to generate an object grabbing action track matched with the quantity condition parameters.
2. The object capturing action trajectory generation method according to claim 1, wherein in the step of processing the multimodal input information to obtain language features, visual features, and quantity condition parameters, the method comprises the sub-steps of: coding the natural language instruction and the visual data in the multi-mode information respectively to obtain language features and visual features; And analyzing the target grabbing quantity from the natural language instruction as a quantity condition parameter.
3. The method for generating the object grabbing action track according to claim 2, wherein the method for analyzing the target grabbing quantity from the natural language command as the quantity condition parameter comprises the steps of identifying grabbing quantity information in the natural language command, directly extracting the exact quantity word as the quantity condition parameter if the grabbing quantity information is the exact quantity word, and inputting the natural language command into a pre-trained quantity prediction model as the quantity condition parameter if the grabbing quantity information is the fuzzy quantity word, and correspondingly outputting a specific numerical value.
4. The object capturing action trajectory generation method according to claim 1, wherein in the step of inputting the language feature, the visual feature, the quantity condition parameter, and a random noise into a pre-trained condition diffusion model, the object capturing action trajectory matching the quantity condition parameter is generated, comprising the sub-steps of: sampling a random noise from a standard normal distribution; Inputting the language features, the visual features, the quantity condition parameters and the random noise into the pre-trained conditional diffusion model, and carrying out iterative reverse denoising on the random noise until the reverse denoising of the preset step number is completed, so as to generate an object grabbing action track matched with the quantity condition parameters.
5. The method for generating an object grabbing action track according to claim 1, wherein the language features and the visual features are injected into the conditional diffusion model through a cross-attention mechanism, and the number of condition parameters are injected into the conditional diffusion model through a global injection mechanism; the global injection mechanism is characterized in that the quantity condition parameters are converted into condition vectors, and the condition vectors and each layer of characteristics of the conditional diffusion model in the denoising process are fused through addition operation.
6. The method of claim 5, wherein the generated capturing motion trajectory is characterized by adaptively adjusting the trajectory morphology according to different quantity condition parameters and visual characteristics, and when the quantity condition parameters are greater than a preset threshold and the target objects indicated by the visual characteristics are dispersed in space, the capturing motion trajectory automatically comprises a complex capturing strategy of gathering before capturing.
7. A method of training a conditional diffusion model, for training a conditional diffusion model as claimed in any one of claims 1 to 6, comprising the steps of: constructing a training data set, wherein each piece of data comprises visual data, language instructions, expert action tracks and corresponding target grabbing numbers; Coding the visual data and the language instruction to obtain visual features and language features; forward diffusing the expert action track, gradually adding noise to obtain a noisy track with a set diffusion step number, wherein the noise added in each step is determined by a preset noise variance sequence; Taking the noisy track, the diffusion step number, the visual characteristics, the language characteristics and the target grabbing quantity as the input of the conditional diffusion model, and predicting the noise added to the expert action track to obtain predicted noise; and constructing a loss function based on the predicted noise and the actually added noise, optimizing model parameters by minimizing the loss function, and iterating for a plurality of times until convergence to obtain a pre-trained conditional diffusion model.
8. A robot gripping method, characterized by comprising the object gripping action trajectory generation method according to any one of claims 1 to 6, further comprising the step of transmitting the generated object gripping action trajectory to a motion controller of a robot to drive an execution mechanism to execute a gripping action.
9. An object grabbing action track generating apparatus, comprising: The information acquisition module is used for acquiring multi-mode input information, wherein the multi-mode input information comprises a natural language instruction and visual data of a robot working scene; The information processing module is used for processing the multi-mode input information to obtain language characteristics, visual characteristics and quantity condition parameters; The track generation module is used for inputting the language features, the visual features, the quantity condition parameters and random noise into a pre-trained condition diffusion model to generate an object grabbing action track matched with the quantity condition parameters.
10. A robot comprising the object gripping motion trajectory generation device according to claim 9, further comprising: the motion controller is used for receiving the generated object grabbing action track and driving an executing mechanism based on the object grabbing action track; And the executing mechanism is used for executing the grabbing task under the driving of the motion controller.

Description

Object grabbing action track generation method and device Technical Field The invention relates to the technical field of artificial intelligence, in particular to a method and a device for generating an object grabbing action track. Background A vision-language-action (VLA) model is used as a leading direction for general intelligence of robots to interact with the physical world through natural language. Currently, the most advanced VLA models in the industry (e.g., google's RT series, PHYSICAL INTELLIGENCEModels, openVLA of UC Berkeley, etc.) have demonstrated powerful capabilities that can understand complex language instructions and visual scenes and directly generate continuous, multi-step robot motion trajectories to achieve end-to-end control. The mode of directly outputting the high-frequency and dense action sequences abandons the rigidifying paradigm of decomposing the high-level instruction into predefined and discrete action primitives (Action Primitives) in the early stage, so that the actions of the robot are smoother and more natural, and the robot can be better adapted to dynamic and unstructured environments. However, despite the revolutionary breakthrough in motion representation and generation by these advanced VLA models, they still suffer from the fundamental disadvantage of containing a "number" concept of instructions when understanding and executing a critical class of instructions. For example, for the very common instructions in industry and daily life such as "grasp three biscuits from a box", "pick up a screwdriver" or "grasp some apples", the meaning of the number words "three", "one", "some" and the like in the existing VLA model cannot be accurately understood. When faced with multiple object grabbing tasks, existing VLA models typically exhibit a degradation of ① behavior in that only one object is grabbed erroneously, ignoring the number of demands in the instruction. ② The model cannot respond because of the lack of a corresponding multi-object capture paradigm in the training data, and the model cannot generate any meaningful actions. ③ And repeating the single grabbing, namely calling the VLA model for a plurality of times through an external script or an advanced planner, and grabbing one object at a time. This approach is inefficient and does not follow the intuitive behavior of a human being to grasp multiple objects at once, not true multiple object grasping. At present, no effective method for seamlessly and end-to-end integrating the high-level semantic concept of quantity into the continuous track generation process of a VLA model exists in academia. Thus, there is a need for a new method and system that enables robots to truly understand quantity and directly generate multiple object gripping trajectories that accommodate any number of complex, sequential, objects. Disclosure of Invention Aiming at the defects of the prior art, the technical problem to be solved by the invention is to provide a method and a device for generating the object grabbing action track, which can enable robots to truly understand the quantity and directly generate and adapt to any quantity. In order to solve the technical problems, the invention adopts a technical scheme that the method for generating the object grabbing action track comprises the following steps: Acquiring multi-mode input information, wherein the multi-mode input information comprises natural language instructions and visual data of a robot working scene; processing the multi-mode input information to obtain language characteristics, visual characteristics and quantity condition parameters; Inputting the language features, the visual features, the quantity condition parameters and a random noise into a pre-trained condition diffusion model to generate an object grabbing action track matched with the quantity condition parameters. Further, in the step of processing the multimodal input information to obtain language features, visual features, and quantity condition parameters, the method includes the following substeps: coding the natural language instruction and the visual data in the multi-mode information respectively to obtain language features and visual features; And analyzing the target grabbing quantity from the natural language instruction as a quantity condition parameter. Further, the method for analyzing the target grabbing quantity from the natural language instruction as the quantity condition parameter comprises the steps of identifying grabbing quantity information in the natural language instruction, directly extracting the exact quantity words to be used as the quantity condition parameter if the grabbing quantity information is the exact quantity words, and inputting the natural language instruction into a pre-trained quantity prediction model to correspondingly output specific numerical values to be used as the quantity condition parameter if the grabbing quantity information is the fuzzy quantity wo