CN-121696998-B - Robot control method, system, terminal and storage medium based on visual language action model

CN121696998BCN 121696998 BCN121696998 BCN 121696998BCN-121696998-B

Abstract

The invention discloses a robot control method, a system, a terminal and a storage medium based on a visual language action model, wherein the method comprises the steps of obtaining a visual image and a task instruction of a robot, and performing risk assessment in a pre-training visual language model according to the visual image and the task instruction to obtain an environmental risk identification result; acquiring a depth image and a camera internal reference matrix, and if the environmental risk identification result is a high risk scene, performing three-dimensional reconstruction and geometric fusion on the candidate obstacle set according to the depth image and the camera internal reference matrix to obtain a target safety geometric boundary; and generating a nominal action according to the visual image and the task instruction, correcting the nominal action according to the target safety geometric boundary if the environmental risk identification result is a high risk scene, obtaining a control instruction, and sending the control instruction to the robot controller for execution. The invention evaluates the risk based on the vision and the instruction, and realizes the efficient execution and control of the robot action for the robot correction action.

Inventors

Yin dongfu
ZHANG JINQUAN
TIAN ZHEN
YANG RUN
YAN YUFENG

Assignees

人工智能与数字经济广东省实验室(深圳)

Dates

Publication Date: 20260508
Application Date: 20260224

Claims (8)

1. A robot control method based on a visual language action model, characterized in that the robot control method based on the visual language action model comprises: acquiring a visual image and a task instruction of a robot, and performing risk assessment in a pre-training visual language model according to the visual image and the task instruction to obtain an environmental risk identification result; Acquiring a depth image and a camera internal reference matrix, and if the environmental risk identification result is a high risk scene, performing three-dimensional reconstruction and geometric fusion on a candidate obstacle set according to the depth image and the camera internal reference matrix to obtain a target safety geometric boundary; Generating a nominal action according to the visual image and the task instruction, correcting the nominal action according to the target safety geometric boundary to obtain a control instruction if the environmental risk identification result is a high risk scene, determining the control instruction according to the nominal action if the environmental risk identification result is a low risk scene, and transmitting the control instruction to a robot controller for execution; And if the environmental risk recognition result is a high risk scene, performing three-dimensional reconstruction and geometric fusion on the candidate obstacle set according to the depth image and the camera internal reference matrix to obtain a target safety geometric boundary, wherein the method specifically comprises the following steps of: if the environmental risk identification result is a high risk scene, performing position extraction on the candidate obstacle set to obtain a plurality of space positioning points; Acquiring a depth image and a camera internal reference matrix of the candidate obstacle set, converting all the space positioning points according to the depth image and the camera internal reference matrix, and acquiring a three-dimensional point cloud set; Performing outlier removal processing and spatial straight-through filtering processing on the three-dimensional point cloud set to obtain an observation point cloud; performing three-dimensional reconstruction and geometric fusion on the candidate obstacle set according to the observation point cloud to obtain a target safety geometric boundary; the three-dimensional reconstruction and geometric fusion are carried out on the candidate obstacle set according to the observation point cloud to obtain a target safety geometric boundary, and the method specifically comprises the following steps: Obtaining standard geometric shape parameters of a pre-constructed semantic geometric commonality library; performing geometric constraint on the candidate obstacle set according to the observation point cloud and the standard geometric parameters to obtain an initial safety geometric boundary: ; ; Wherein, the Representing a matrix of observed shapes, Representing the center of the observed shape, Represents the set of observation point clouds, Represents a point in the observation point cloud, Representing the determinant of the matrix, Representing a transpose; And obtaining a confidence coefficient factor, and carrying out weighted fusion on the initial safety geometric boundary according to the confidence coefficient factor and the observation point cloud to obtain a target safety geometric boundary.
2. The method for controlling a robot based on a visual language action model according to claim 1, wherein the steps of obtaining a visual image and a task instruction of the robot, performing risk assessment in a pre-training visual language model according to the visual image and the task instruction, and obtaining an environmental risk recognition result, and then further comprise: inputting the visual image and the task instruction into the pre-training visual language model for prediction to obtain probability distribution; And screening the prompt words of the environmental risk recognition result according to the probability distribution to obtain a candidate obstacle set.
3. The method of claim 1, wherein the target safety geometrical boundary comprises a target safety center and a safety ellipsoid shape matrix; The obtaining the confidence coefficient factor, and carrying out weighted fusion on the initial safety geometric boundary according to the confidence coefficient factor and the observation point cloud to obtain a target safety geometric boundary, specifically comprising: Obtaining a confidence factor, the number of point clouds of the observation point clouds, an observation center of a standard object and a priori center: registering the initial safety geometric boundary according to the observation center and the prior center to obtain a target safety center; And carrying out weighted fusion on the initial safety geometric boundary according to the standard geometric parameters to obtain a safety ellipsoid shape matrix: ; ; Wherein, the The confidence factor is represented as a function of the confidence, Representing the a priori center, Representing the safety center of the object, Representing a matrix of shapes of the security ellipsoids, Representing an a priori shape matrix.
4. The robot control method based on a visual language action model according to claim 3, wherein the generating a nominal action according to the visual image and the task instruction, if the environmental risk recognition result is a high risk scene, correcting the nominal action according to the target safety geometrical boundary to obtain a control instruction, if the environmental risk recognition result is a low risk scene, determining a control instruction according to the nominal action, and transmitting the control instruction to a robot controller for execution, specifically comprising: Inputting the visual image and the task instruction into a visual language action model to solve, and generating a nominal action; And if the environmental risk recognition result is a high risk scene, constructing a control obstacle constraint function according to the target safety geometric boundary, correcting the nominal action according to the control obstacle constraint function to obtain a control instruction, and if the environmental risk recognition result is a low risk scene, determining the control instruction according to the nominal action, and transmitting the control instruction to a robot controller for execution.
5. The method for controlling a robot based on a visual language action model according to claim 4, wherein if the environmental risk recognition result is a high risk scene, a control obstacle constraint function is constructed according to the target safety geometrical boundary, and the nominal action is corrected according to the control obstacle constraint function to obtain a control instruction, which specifically comprises: if the environmental risk identification result is a high-risk scene, current environmental data is obtained, and the security ellipsoid shape matrix is constrained according to the current environmental data to obtain a plurality of obstacle constraints; constructing a control obstacle constraint function according to the target safety geometrical boundary and a plurality of obstacle constraints: ; Wherein, the Indicating the current state of the robot, Indicating a safety margin, which is to be taken into account, The function of the minimum distance is represented as, The expression of an ellipsoid is given, Represent the first The presence of a number of obstacles such as, Representing a control obstacle constraint function; determining an optimal safety control amount according to the control obstacle constraint function; correcting the nominal action according to the optimal safety control quantity to obtain a control instruction: ; Wherein, the Indicating the optimum safety control amount of the vehicle, Represents the optimal variables to be used in the process of optimizing, Representing a nominal motion.
6. A robot control system based on a visual language action model, wherein the robot control system based on a visual language action model is applied to the robot control method based on a visual language action model according to any one of claims 1 to 5, the robot control system based on a visual language action model comprising: the obstacle recognition module is used for acquiring a visual image and a task instruction of the robot, and performing risk assessment in a pre-training visual language model according to the visual image and the task instruction to obtain an environmental risk recognition result; The safety boundary constraint module is used for acquiring a depth image and a camera internal reference matrix, and if the environmental risk identification result is a high-risk scene, performing three-dimensional reconstruction and geometric fusion on a candidate obstacle set according to the depth image and the camera internal reference matrix to obtain a target safety geometric boundary; And the robot execution module is used for generating a nominal action according to the visual image and the task instruction, correcting the nominal action according to the target safety geometric boundary to obtain a control instruction if the environmental risk identification result is a high risk scene, determining the control instruction according to the nominal action if the environmental risk identification result is a low risk scene, and transmitting the control instruction to a robot controller for execution.
7. A terminal comprising a memory, a processor and a visual language action model based robot control program stored on the memory and executable on the processor, the visual language action model based robot control program when executed by the processor implementing the steps of the visual language action model based robot control method according to any one of claims 1-5.
8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a robot control program based on a visual language action model, which when executed by a processor, implements the steps of the robot control method based on a visual language action model according to any one of claims 1-5.

Description

Robot control method, system, terminal and storage medium based on visual language action model Technical Field The invention relates to the technical field of intelligent control of robots, in particular to a robot control method, a system, a terminal and a computer readable storage medium based on a visual language action model. Background With the rapid development of self-contained intelligent technology, a Visual-Language-Action (VLA) model has become a dominant paradigm in the field of robot control. The VLA model can directly map unstructured visual images and natural language instructions into low-level control actions of the robot through end-to-end training, and strong semantic understanding capability and generalization capability are shown. Current VLA models are typically "black box" strategies, the output of which is probabilistic rather than based on certain physical constraints. In order to solve the security problem, the existing mainstream technical route is to integrate the security constraint into the training process by using reinforcement learning (RL, reinforcement Learning), and the existing VLA model mainly focuses on the task completion rate, ignoring strict security guarantee. Reinforcement learning methods typically treat security as "soft targets" (i.e., rewards penalties), rather than "hard constraints," that fail to provide a strict, deterministic security boundary during the inference phase. When facing scenes outside the training data distribution, the VLA model is very prone to unstable jitter or collision trajectories. The Retraining (RETRAINING) based method is computationally expensive and difficult to directly apply to already trained large-scale VLA models. Once the security policies need to be adjusted, the model often needs to be retrained, lacking plug-and-play flexibility. The pure end-to-end model has semantic understanding capability, but lacks geometric prior knowledge of the physical world, so that the physical volume of the obstacle cannot be estimated correctly when vision is blocked or data is sparse. The traditional method generally adopts a VLA model, and a safety constraint method based on Reinforcement Learning (RL) is difficult to realize strict safety control and efficient execution of robot actions under the conditions that the safety is insufficient (soft constraint instead of hard constraint), the calculation cost is high and the flexibility is lacking in strategy adjustment, and the physical geometric priori is lacking to cause that the obstacle volume cannot be estimated correctly in vision occlusion or data sparseness, so that the problem to be solved is urgent at present. Accordingly, the prior art is still in need of improvement and development. Disclosure of Invention The invention mainly aims to provide a robot control method, a system, a terminal and a computer readable storage medium based on a visual language action model, and aims to solve the problem that strict safety control and efficient execution of robot actions are difficult to realize under the condition that the volume of an obstacle cannot be estimated correctly when vision is blocked or data is sparse in the prior art. In order to achieve the above object, the present invention provides a robot control method based on a visual language action model, the robot control method based on a visual language action model comprising the steps of: acquiring a visual image and a task instruction of a robot, and performing risk assessment in a pre-training visual language model according to the visual image and the task instruction to obtain an environmental risk identification result; Acquiring a depth image and a camera internal reference matrix, and if the environmental risk identification result is a high risk scene, performing three-dimensional reconstruction and geometric fusion on a candidate obstacle set according to the depth image and the camera internal reference matrix to obtain a target safety geometric boundary; And generating a nominal action according to the visual image and the task instruction, correcting the nominal action according to the target safety geometric boundary to obtain a control instruction if the environmental risk identification result is a high risk scene, determining the control instruction according to the nominal action if the environmental risk identification result is a low risk scene, and transmitting the control instruction to a robot controller for execution. Optionally, in the method for controlling a robot based on a visual language action model, the acquiring a visual image and a task instruction of the robot, performing risk assessment in a pre-training visual language model according to the visual image and the task instruction, and obtaining an environmental risk recognition result, and then further includes: inputting the visual image and the task instruction into the pre-training visual language model for prediction to obtain probability d