CN-122008209-A - Robot control method and system based on visual chain thinking

CN122008209ACN 122008209 ACN122008209 ACN 122008209ACN-122008209-A

Abstract

The invention discloses a robot control method based on vision chain thinking, which comprises the following steps of S1, obtaining current vision observation and natural language instructions of a robot, S2, generating a sub-target image representing a task intermediate state in an autoregressive mode based on the current vision observation and the natural language instructions through a unified vision-language model, S3, generating an action sequence for controlling the robot based on the current vision observation, the task instructions and the sub-target image, and S4, executing the action sequence to control the robot to transition to a motion state represented by the sub-target image. The invention also discloses a robot control system based on the visual chain type thinking, which is used for realizing the robot control method based on the visual chain type thinking and comprises a perception module, an instruction receiving module, a visual chain type thinking reasoning module and a control output module.

Inventors

WANG MAOLIN
ZHANG PENG

Assignees

深圳市金大智能创新科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260203

Claims (10)

1. The robot control method based on visual chain thinking is characterized by comprising the following steps of: S1, acquiring current visual observation and natural language instructions of a robot; s2, generating a sub-target image representing the intermediate state of the task in an autoregressive mode through a unified vision-language model based on the current vision observation and the natural language instruction; Step S3, generating an action sequence for controlling the robot based on the current visual observation, the task instruction and the sub-target image; And S4, executing the action sequence to control the robot to transition to the motion state represented by the sub-target image.
2. The vision-chain thinking-based robot control method of claim 1, characterized in that after the action sequence is generated and executed, the following steps are also executed: S5, acquiring new visual observation of the robot; and S6, taking the new visual observation as the current visual observation, repeating the steps of claim 1, and performing closed-loop control until the task corresponding to the natural language instruction is completed.
3. The method for controlling a robot based on vision chain thinking as claimed in claim 1, wherein the action sequence in step S3 comprises a plurality of action blocks of continuous time step actions.
4. The robot control method based on vision chain thinking as claimed in claim 1, wherein the generation of the action sequence in step S3 uses a full-attention mechanism for action token prediction.
5. The robot control method based on vision chain thinking as claimed in claim 1, characterized in that the vision-language model is obtained through two-stage process training of a pre-training stage and an adaptation stage, the pre-training stage trains the model by using a combined data set comprising robot demonstration data and no-action video data, and the adaptation stage fine-tunes the pre-trained model by using specific task data of a target robot scene, wherein the sum of vision generation loss generated by a training target for sub-target images and action generation loss generated by an action sequence.
6. The vision chain thinking-based robot control method as claimed in claim 5, characterized in that the vision generation loss calculation formula of the sub-target image generation is: L visual =−∑ j log P δ (k jd ∣k j,<d ); Where P δ represents the predicted probability for a model parameter delta, k jd represents the d discrete token of the j-th visual input, and k j,<d represents all tokens preceding the d-th token; the action generation loss calculation formula generated by the action sequence is as follows: L action =− log P θ (at...a t+m ∣l,s t ,s t+n ); Where P θ represents the predicted probability for a model parameter θ, at...a t+m represents a series of sequential actions from time t to t+m, l represents natural language instructions, s t represents the visual observation at the current time, and s t+n represents the predicted sub-target image.
7. The vision chain thinking-based robot control method of claim 5, characterized in that the generation of sub-target images in the pre-training phase adopts a causal attention mechanism and the generation of action sequences adopts a full attention mechanism.
8. The method for vision-chain thinking-based robot control of claim 5, characterized in that said robot presentation data is a selected subset of the Open X-Embodiment dataset, and said actionable video data comprises the EPIC-KITCHEN-100 dataset and/or the soing-soing V2 dataset.
9. The visual chain thinking-based robot control method of claim 1, characterized in that said unified visual-language model is VILA-U model.
10. A vision-chain-thought-based robot control system for implementing the vision-chain-thought-based robot control method as claimed in any one of claims 1 to 8, comprising: The perception module is used for acquiring the current visual observation of the environment; The instruction receiving module is used for receiving natural language task instructions; the visual chain type thinking reasoning module is connected with the perception module and the instruction receiving module, can generate a sub-target image representing the intermediate state of a task based on the current visual observation and the task instruction, and can generate an action sequence based on the current visual observation, the task instruction and the sub-target image; the control output module is connected with the visual chain type thinking reasoning module and is used for sending the action sequence to a robot for execution; the perception module acquires new visual observations after the robot performs actions, and provides the new visual observations as subsequent current visual observations to the visual chain type thinking reasoning module to form closed-loop control.

Description

Robot control method and system based on visual chain thinking Technical Field The invention relates to the technical fields of robot learning, computer vision and natural language processing, in particular to a robot control method and system based on vision chain thinking. Background In recent years, a vision-language-action (VLA) model has demonstrated great potential in learning generalized perceptual motion control strategies by virtue of its ability to utilize a large-scale pre-trained vision-language model (VLMs) and diverse robotic presentation data. However, the current mainstream VLA model mainly focuses on building a direct mapping from observed actions, lacking explicit modeling of critical intermediate reasoning steps in complex tasks. Such defects result in existing models, when faced with tasks requiring multi-step planning, complex environmental understanding, or prediction of future states, often exhibit deficiencies in timing planning and advanced reasoning capabilities, limiting their utility in real-world complex scenarios. In the field of natural language processing, the chain type thinking prompt technology has become an effective paradigm for improving the reasoning capacity of a large language model. The thinking mode is applied to the field of robot control in a visual mode, and a new thought is provided for integrating advanced reasoning capability into a closed loop of perception and action. Although research attempts have been made to introduce language descriptions, key points or intermediate representations such as bounding boxes to enhance robot reasoning, these methods often require additional preprocessing flows, capture more abstract states, and cannot directly guide a robot to accurately plan actions in a pixel space, how to perform intermediate reasoning in a more natural and direct visual form, and effectively utilize massive unlabeled video data, so that a technical problem to be solved is still urgent. The present invention has been made in view of the above drawbacks. Disclosure of Invention The invention aims to overcome the defects of the prior art, provides a robot control method based on visual chain thinking, predicts future sub-target images as intermediate reasoning results by introducing explicit visual chain thinking before a robot executes actions, and generates an accurate action sequence based on the intermediate reasoning results. The invention is realized by the following technical scheme: the robot control method based on visual chain thinking is characterized by comprising the following steps of: S1, acquiring current visual observation and natural language instructions of a robot; s2, generating a sub-target image representing the intermediate state of the task in an autoregressive mode through a unified vision-language model based on the current vision observation and the natural language instruction; Step S3, generating an action sequence for controlling the robot based on the current visual observation, the task instruction and the sub-target image; And S4, executing the action sequence to control the robot to transition to the motion state represented by the sub-target image. The robot control method based on visual chain thinking as described above, characterized in that after the action sequence is generated and executed, the following steps are also executed: S5, acquiring new visual observation of the robot; and S6, taking the new visual observation as the current visual observation, repeating the steps of claim 1, and performing closed-loop control until the task corresponding to the natural language instruction is completed. The robot control method based on the visual chain thinking is characterized in that the action sequence in the step S3 comprises a plurality of action blocks of continuous time step actions. The robot control method based on the visual chain thinking is characterized in that the generation of the action sequence in the step S3 adopts a full-attention mechanism to conduct action token prediction. The robot control method based on the vision chain thinking is characterized in that the vision-language model is obtained through two-stage process training of a pre-training stage and an adaptation stage, the pre-training stage is used for training the model by utilizing a combined data set containing robot demonstration data and no-action video data, the training target is the sum of vision generation loss generated by a sub-target image and action generation loss generated by an action sequence, and the adaptation stage is used for fine-tuning the model after pre-training by utilizing specific task data of a target robot scene. The robot control method based on visual chain thinking is characterized in that a visual generation loss calculation formula for generating the sub-target image is as follows: Lvisual=−∑jlog Pδ(kjd∣kj,<d) Where P δ represents the predicted probability for a model parameter delta, k jd represents the