CN-121981152-A - Robot operation control method and device based on explicit intermediate vision characterization

CN121981152ACN 121981152 ACN121981152 ACN 121981152ACN-121981152-A

Abstract

The invention provides a robot operation control method and device based on explicit intermediate vision characterization, the method comprises the steps of receiving a current scene image and task instructions of a robot view angle, generating a control token based on a current task execution state, determining a trigger reasoning mode or an action mode according to the control token, performing time sequence reasoning according to the current scene image, the task instructions and the completed history of the subtasks to obtain a next subtask, performing spatial reasoning based on the next subtask to obtain spatial operation intention of the next subtask, converting the spatial operation intention of the next subtask into explicit intermediate vision characterization of the next subtask, generating a robot action sequence based on the explicit intermediate vision characterization if the control token triggers the action mode, and executing the robot action sequence to control the robot operation. The invention obviously improves the success rate of the reference task and the robustness to the complex dynamic scene.

Inventors

ZHAO ZHONGXIA
TAN HUAJIE
XU XIANGQI
XU YIJIE
ZHANG SHANGHANG

Assignees

北京智源人工智能研究院

Dates

Publication Date: 20260505
Application Date: 20251222

Claims (10)

1. A robot operation control method based on explicit intermediate vision characterization, comprising: receiving a current scene image of a robot view angle and a task instruction; generating a control token based on the current task execution state, and determining a trigger reasoning mode or an action mode according to the control token; Under the condition that the control token triggers the reasoning mode, carrying out time sequence reasoning according to the current scene image, the task instruction and the history of completed subtasks to obtain a next subtask, carrying out space reasoning based on the next subtask to obtain a space operation intention of the next subtask, and converting the space operation intention of the next subtask into an explicit intermediate vision representation of the next subtask, wherein the explicit intermediate vision representation is a geometric primitive set which is rendered on the current scene image and used for representing the space operation intention of the next subtask; And under the condition that the control token triggers the action mode, generating a robot action sequence through a flow matching model based on the explicit intermediate vision representation of the next subtask and the existing explicit intermediate vision representation, and executing the robot action sequence to control the robot operation.
2. The method for controlling operation of a robot based on explicit mesopic vision characterization of claim 1, wherein the current task execution state comprises a task suspension state or a subtask normal execution state, the task suspension state comprising at least one of subtask completion, an operation error, and an external intervention; the method for generating the control token based on the current task execution state and determining the triggering reasoning mode or the action mode according to the control token comprises the following steps: Generating a starting reasoning token to trigger the reasoning mode under the condition that the current task execution state is the task suspension state; And generating a starting action token to trigger the action mode under the condition that the current task execution state is the normal execution state of the subtasks.
3. The robot operation control method based on explicit mesopic characterization of claim 1, wherein the set of geometric primitives comprises: a bounding box for framing a target object; a key point for specifying a precise interaction location; and an arrow for indicating a movement direction, the arrow including a translation arrow defined by a start-stop key point and a rotation arrow defined by a rotation center, a rotation axis, and a rotation direction.
4. The robot operation control method based on the explicit intermediate vision characterization according to claim 1, wherein the converting the spatial operation intention of the next subtask into the explicit intermediate vision characterization of the next subtask includes: Encoding the current scene image and the rendered explicit intermediate vision representation through a pre-trained vision encoder to obtain a vision token sequence, wherein the vision token sequence is the encoded feature; Receiving the visual token sequence, the task instruction after word segmentation and the historical subtask text, and adopting a large language model based on a Transformer to autoregressively generate a text description describing an explicit intermediate visual representation corresponding to the next subtask based on the space operation intention of the next subtask; And determining the explicit intermediate vision representation of the next subtask based on the text description of the explicit intermediate vision representation corresponding to the next subtask.
5. The method according to claim 4, wherein the flow matching model generates a model for a condition based on flow matching, wherein the generating a robot action sequence by the flow matching model based on the explicit intermediate vision representation of the next subtask and the existing explicit intermediate vision representation comprises: Using the ontology sensing state of the robot and the potential representation output by the large language model based on the transducer as conditions, adopting the condition generating model based on the flow matching to solve a normal differential equation, and generating a continuous robot action sequence; The robot body perception state is obtained by encoding the explicit intermediate vision representation of the next subtask, the existing explicit intermediate vision representation, the current scene image, the task instruction and the history of the completed subtask.
6. The robot operation control method based on explicit intermediate vision characterization of any one of claims 1-5, wherein prior to the performing the sequence of robot actions to control robot operation, further comprising: providing the current generated explicit intermediate visual representation to the target user; Receiving editing input of the target user on the currently generated explicit intermediate vision representation, and obtaining an edited explicit intermediate vision representation; And updating the robot action sequence according to the edited explicit intermediate vision representation.
7. The robot operation control method based on explicit intermediate vision characterization according to any one of claims 1 to 5, wherein the robot operation control method based on explicit intermediate vision characterization is performed by a target large language model trained by multi-stage course learning, the multi-stage course learning comprising: Performing basic pre-training on the initial large model by using a space-time understanding data set to obtain a first large language model, wherein the space-time understanding data set comprises visual positioning, space orientation and scene understanding data; Training the first large model by using a pairing data set training model to obtain a second large language model, wherein the pairing data set comprises a task instruction, a subtask sequence and an explicit intermediate vision representation corresponding to the subtask sequence; And combining a training mode switching step and an action sequence generating step, and training the second large language model by using a mode balance sampling strategy to obtain the target large language model.
8. A robotic manipulation control device based on explicit intermediate vision characterization, comprising: The receiving module is used for receiving the current scene image of the robot view angle and a task instruction; The mode control module is used for generating a control token based on the current task execution state and determining a trigger reasoning mode or an action mode according to the control token; The system comprises an inference module, a control token, a task instruction module and a task instruction module, wherein the inference module is used for performing time sequence inference according to the current scene image, the task instruction and the history of completed subtasks under the condition that the control token triggers the inference mode to obtain a next subtask, performing space inference based on the next subtask to obtain a space operation intention of the next subtask, and converting the space operation intention of the next subtask into an explicit intermediate vision representation of the next subtask; And the action generating and executing module is used for generating a robot action sequence through a flow matching model based on the explicit intermediate vision representation of the next subtask and the existing explicit intermediate vision representation under the condition that the control token is determined to trigger the action mode, and executing the robot action sequence to control the robot operation.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the robot operation control method based on explicit mesopic characterization according to any of claims 1 to 7 when executing the computer program.
10. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the robot operation control method based on explicit intermediate vision characterization according to any one of claims 1 to 7.

Description

Robot operation control method and device based on explicit intermediate vision characterization Technical Field The invention relates to the technical field of robots, in particular to a robot operation control method and device based on explicit intermediate vision characterization. Background Long-range robotic operation is increasingly important for real world deployments, which require spatial disambiguation in complex layouts, and remain time-flexible under dynamic interactions. However, existing end-to-end and hierarchical visual-language-action (Vision Language Action, VLA) strategies typically rely on plain text cues only, hiding the intent of the program in the potential representation, which weakens the instruction landing ability in cluttered or unspecified scenes, hampers efficient task decomposition of long-range targets with closed-loop interactions, and limits causal interpretation by obscuring the underlying principles behind action selection. The prior art framework has the problems of spatial ambiguity and temporal vulnerability in coping with long-term robot operation tasks. Therefore, an effective solution is needed to solve the above technical problems. Disclosure of Invention The invention provides a robot operation control method and device based on explicit intermediate vision characterization, which solve the problems of spatial ambiguity and temporal vulnerability in the prior art, and realize remarkable improvement of success rate of a reference task and excellent performance of a space complex task. In a first aspect, the present invention provides a robot operation control method based on explicit intermediate vision characterization, the method comprising the steps of: receiving a current scene image of a robot view angle and a task instruction; generating a control token based on the current task execution state, and determining a trigger reasoning mode or an action mode according to the control token; Under the condition that the control token triggers the reasoning mode, carrying out time sequence reasoning according to the current scene image, the task instruction and the history of completed subtasks to obtain a next subtask, carrying out space reasoning based on the next subtask to obtain a space operation intention of the next subtask, and converting the space operation intention of the next subtask into an explicit intermediate vision representation of the next subtask, wherein the explicit intermediate vision representation is a geometric primitive set which is rendered on the current scene image and used for representing the space operation intention of the next subtask; And under the condition that the control token triggers the action mode, generating a robot action sequence through a flow matching model based on the explicit intermediate vision representation of the next subtask and the existing explicit intermediate vision representation, and executing the robot action sequence to control the robot operation. According to the robot operation control method based on the explicit intermediate vision characterization, the current task execution state comprises a task suspension state or a subtask routine execution state, wherein the task suspension state comprises at least one of subtask completion, operation errors and external intervention; the method for generating the control token based on the current task execution state and determining the triggering reasoning mode or the action mode according to the control token comprises the following steps: Generating a starting reasoning token to trigger the reasoning mode under the condition that the current task execution state is the task suspension state; And generating a starting action token to trigger the action mode under the condition that the current task execution state is the normal execution state of the subtasks. According to the robot operation control method based on explicit mesopic vision characterization, the geometric primitive set comprises: a bounding box for framing a target object; a key point for specifying a precise interaction location; and an arrow for indicating a movement direction, the arrow including a translation arrow defined by a start-stop key point and a rotation arrow defined by a rotation center, a rotation axis, and a rotation direction. According to the robot operation control method based on the explicit intermediate vision representation, the method for converting the space operation intention of the next subtask into the explicit intermediate vision representation of the next subtask comprises the following steps: Encoding the current scene image and the rendered explicit intermediate vision representation through a pre-trained vision encoder to obtain a vision token sequence, wherein the vision token sequence is the encoded feature; Receiving the visual token sequence, the task instruction after word segmentation and the historical subtask text, and adopting a large language m