Search

CN-121696992-B - Robot control method, system, terminal and storage medium based on semantic embedding and attention double consistency

CN121696992BCN 121696992 BCN121696992 BCN 121696992BCN-121696992-B

Abstract

The invention relates to the technical field of intelligent control, and discloses a robot control method, a system, a terminal and a storage medium based on semantic embedding and attention double consistency, wherein the method comprises the steps of obtaining an action sequence of a target robot; analyzing real multi-mode semantic characterization, real attention distribution, simulated multi-mode semantic characterization and simulated attention distribution according to the action sequence, constructing a loss function according to the real multi-mode semantic characterization, the real attention distribution, the simulated multi-mode semantic characterization and the simulated attention distribution, training the branches of students in the vision-language-action model, inputting test disturbed images and test language instructions into the trained branches of the students, and outputting control instructions of a control target robot. According to the invention, through explicit constraint of attention consistency and patch suppression, cross-modal attention is focused from a patch area to a real target area related to tasks, interference of the counterpatch on the model attention logic is reduced, the model robustness is obviously improved by adopting a lightweight fine tuning strategy, and the success rate of task execution is improved.

Inventors

  • Yin dongfu
  • ZHANG JINQUAN
  • LENG JIAN
  • YANG RUN
  • DONG JUNXIN
  • LIN ZHAO

Assignees

  • 人工智能与数字经济广东省实验室(深圳)

Dates

Publication Date
20260508
Application Date
20260213

Claims (9)

  1. 1. The robot control method based on the semantic embedding and the attention double consistency is characterized by comprising the following steps of: acquiring a natural language instruction input by a user, and acquiring a plurality of observation images acquired by a target robot based on the natural language instruction; Constructing a vision-language-action model, inputting all the observation images and the natural language instructions into the vision-language-action model, and outputting a real multi-mode high-dimensional characterization, a real attention probability matrix, a simulated multi-mode high-dimensional characterization and a simulated attention probability matrix; the construction of a vision-language-action model, inputting all the observation images and the natural language instructions into the vision-language-action model, and outputting a real multi-mode high-dimensional characterization, a real attention probability matrix, a simulation multi-mode high-dimensional characterization and a simulation attention probability matrix, wherein the construction comprises the following steps of: constructing a vision-language-action model formed by parallel teacher branches and student branches, inputting all observation images and natural language instructions into the teacher branches, and outputting a real multi-mode high-dimensional characterization and a real attention probability matrix; patching all the observation images, inputting the observation images to the branches of the students, and outputting a simulated multi-mode high-dimensional characterization and simulated attention probability matrix; constructing a semantic consistency loss function by using the real multi-mode high-dimensional characterization and the simulated multi-mode high-dimensional characterization, and constructing an attention consistency loss function by using the real attention probability matrix and the simulated attention probability matrix; training the student branches in the vision-language-action model according to the semantic consistency loss function and the attention consistency loss function, inputting test disturbed images and test language instructions into the trained student branches, and outputting control instructions to control the target robot.
  2. 2. The robot control method based on semantic embedding and attention double consistency according to claim 1, wherein the acquiring the natural language instruction input by the user and acquiring the plurality of observation images acquired by the target robot based on the natural language instruction specifically comprises: Acquiring a natural language instruction input by a user, and inputting the natural language instruction into a target robot to control the target robot to move; the target robot collects observation images of a plurality of time points by using an observation camera according to the moving process; modeling the natural language instruction and each observed image to obtain an action sequence of the target robot: ; Wherein, the Representing a sequence of actions from time point 1 to T, Representing model parameters as Is a visual-language-action model of (1), Representing the observed image at time points from 1 to T, Representing natural language instructions.
  3. 3. The robot control method based on semantic embedding and attention double consistency according to claim 1, wherein the inputting all the observation images and the natural language instructions to the teacher branch outputs a true multi-modal high-dimensional characterization and a true attention probability matrix, specifically comprising: Freezing all parameters of the teacher branch, inputting all the observation images and the natural language instructions to the teacher branch, and outputting a real multi-mode high-dimensional representation; Extracting a set of key layers in the vision-language-action model, wherein the set of key layers represents key layers of the natural language instruction aligned with all observed image areas; and defining the natural language instruction as a real query vector, defining all the observed image areas as real key value vectors, extracting real attention probability distribution in each key layer according to the real query vector and the real key value vectors, and constructing a real attention probability matrix according to all the real attention probability distribution.
  4. 4. The robot control method based on semantic embedding and attention double consistency according to claim 1, wherein the steps of patching all the observation images, inputting the observation images into the student branches, and outputting a simulated multi-mode high-dimensional characterization and simulated attention probability matrix comprise the following steps: superposing a random transformed universal patch on the observation image corresponding to each time point, and obtaining a corresponding patch image after carrying out random geometric transformation on each universal patch: ; Wherein, the A t-th patch image is represented, A binary mask representing the patch location on the patch image, Representing an element-by-element multiplication, Representing the geometric transformation operator of the patch P, Representing the t-th observation image, the image is displayed, Representing the operation of a random geometric transformation; Inputting all patch images and the natural language instructions into the branches of the students, and outputting simulated multi-mode high-dimensional characterization; And defining the natural language instruction as a simulated query vector, defining all patch image areas as simulated key value vectors, extracting simulated attention probability distribution in each key layer according to the simulated query vector and the simulated key value vectors, and constructing a simulated attention probability matrix according to all the simulated attention probability distribution.
  5. 5. The robot control method based on semantic embedding and attention double consistency according to claim 1, wherein the constructing a semantic consistency loss function by using the real multi-modal high-dimensional characterization and the simulated multi-modal high-dimensional characterization, and constructing an attention consistency loss function by using the real attention probability matrix and the simulated attention probability matrix specifically comprises: Calculating consistency between the real multi-modal high-dimensional characterization and the simulated multi-modal high-dimensional characterization to construct a semantic consistency loss function: ; Wherein, the Representing a semantic consistency loss function, Representing a true multi-modal high-dimensional representation, Representing an analog multi-modal high-dimensional representation, Representation of And An L2 norm therebetween; calculating a distribution difference between the true attention probability matrix and the simulated attention probability matrix to construct an attention consistency loss function: ; Wherein, the Representing the attention deficit function in accordance with the present invention, Representing a set of key layers, Represents the KL divergence, which is used to describe the difference between two probability distributions, Representing the true attention probability distribution of the first key layer, Representing the simulated attention probability distribution of the first key layer, Representation of And Is a norm of (c).
  6. 6. The semantic embedding and attention double consistency based robot control method according to claim 1, wherein the training of the student branches in the vision-language-action model according to the semantic consistency loss function and the attention consistency loss function, and inputting test disturbed images and test language instructions to the trained student branches, outputting control instructions to control the target robot, specifically comprises: respectively constructing corresponding balance coefficients for the semantic consistency loss function and the attention consistency loss function, and carrying out weighted fusion on the semantic consistency loss function and the attention consistency loss function according to all the balance coefficients to obtain a total loss function: ; Wherein, the Representing the total loss function of the device, Representation of Is used for the balance coefficient of the (c), Representation of Is used for the balance coefficient of the (c), Representing a semantic consistency loss function, Representing an attention consistency loss function; setting a visual encoder in the student branch to a trainable state and freezing all language model parameters and action decoder parameters in the student branch; Counter-propagating the student branches with the total loss function to update visual encoders of the student branches; Acquiring a test language instruction and a test disturbed image input by a user, inputting the test language instruction and the test disturbed image into a trained student branch, and outputting a robot control instruction; And controlling the target robot according to the robot control instruction, and collecting the current action sequence of the target robot.
  7. 7. A semantic embedding and attention double consistency based robot control system for implementing the semantic embedding and attention double consistency based robot control method according to any one of claims 1 to 6, the semantic embedding and attention double consistency based robot control system comprising: The data acquisition module is used for acquiring a natural language instruction input by a user and acquiring a plurality of observation images acquired by the target robot based on the natural language instruction; The teacher model output module is used for constructing a vision-language-action model, inputting all the observation images and the natural language instructions into the vision-language-action model, and outputting a real multi-mode high-dimensional characterization, a real attention probability matrix, a simulated multi-mode high-dimensional characterization and a simulated attention probability matrix; The data preprocessing module is used for constructing a semantic consistency loss function by utilizing the real multi-mode high-dimensional characterization and the simulated multi-mode high-dimensional characterization, and constructing an attention consistency loss function by utilizing the real attention probability matrix and the simulated attention probability matrix; And the model training module is used for training the student branches in the vision-language-action model according to the semantic consistency loss function and the attention consistency loss function, inputting the test disturbed image and the test language instruction into the trained student branches, and outputting a control instruction to control the target robot.
  8. 8. A terminal comprising a memory, a processor and a semantic-embedding and attention-double-consistency based robot control program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the semantic-embedding and attention-double-consistency based robot control method according to any of claims 1-6.
  9. 9. A computer readable storage medium, characterized in that the computer readable storage medium stores a robot control program based on semantic embedding and attention double consistency, which when executed by a processor, implements the steps of the robot control method based on semantic embedding and attention double consistency according to any of claims 1-6.

Description

Robot control method, system, terminal and storage medium based on semantic embedding and attention double consistency Technical Field The invention relates to the technical field of intelligent control, in particular to a robot control method, a system, a terminal and a computer readable storage medium based on semantic embedding and attention double consistency. Background The body-equipped intelligent VLA Model (Vision-Language-Action Model) models the combination of visual input and natural Language instructions, and directly outputs a robot Action sequence, wherein the end-to-end Model is gradually widely used for the tasks of grabbing, carrying, operating and the like in complex scenes such as industrial production, storage logistics and the like. However, due to the black box characteristic of the VLA model, the output can be directly executed, the small deviation between the perception and the alignment can be amplified into the deviation of the action track and the failure of the task along the 'visual-language-action' link, the existing classification task and recognition task are led to the distillation of teachers-students and keep the consistency of attention, and are mainly used for the information complementation of cross-modes (text-to-visual), so that the group behavior recognition precision is improved in a clean environment, and the problem that the model is subjected to the attack resisting scene in the VLA field cannot be solved. Accordingly, the prior art is still in need of improvement and development. Disclosure of Invention The invention mainly aims to provide a robot control method, a system, a terminal and a computer readable storage medium based on semantic embedding and attention double consistency, which aim to solve the problems that in the prior art, a teacher-student distillation model has fine tuning multi-focusing efficiency and decoding structure, robust defense under attack of a VLA field model cannot be solved, and action execution errors are easy to cause. In order to achieve the above object, the present invention provides a robot control method based on semantic embedding and attention double consistency, comprising the steps of: acquiring a natural language instruction input by a user, and acquiring a plurality of observation images acquired by a target robot based on the natural language instruction; constructing a vision-language-action model, inputting all the observation images and the natural language instructions into the vision-language-action model, and outputting real multi-mode semantic characterization, real attention distribution, simulated multi-mode semantic characterization and simulated attention distribution; Constructing a semantic consistency loss function by utilizing the real multi-mode semantic representation and the simulated multi-mode semantic representation, and constructing an attention consistency loss function by utilizing the real attention distribution and the simulated attention distribution; training the student branches in the vision-language-action model according to the semantic consistency loss function and the attention consistency loss function, inputting test disturbed images and test language instructions into the trained student branches, and outputting control instructions to control the target robot. Optionally, in the method for controlling a robot based on semantic embedding and attention double consistency, the acquiring a natural language instruction input by a user and acquiring a plurality of observation images acquired by a target robot based on the natural language instruction specifically includes: Acquiring a natural language instruction input by a user, and inputting the natural language instruction into a target robot to control the target robot to move; the target robot collects observation images of a plurality of time points by using an observation camera according to the moving process; modeling the natural language instruction and each observed image to obtain an action sequence of the target robot: ; Wherein, the Representing a sequence of actions from time point 1 to T,Representing model parameters asIs a visual-language-action model of (1),Representing the observed image at time points from 1 to T,Representing natural language instructions. Optionally, the method for controlling a robot based on semantic embedding and attention double consistency, wherein the constructing a vision-language-action model, inputting all the observation images and the natural language instructions into the vision-language-action model, and outputting a real multi-mode semantic representation, a real attention distribution, a simulated multi-mode semantic representation and a simulated attention distribution specifically includes: Constructing a vision-language-action model formed by parallel teacher branches and student branches, inputting all observation images and natural language instructions into the teacher branches,