CN-122008191-A - Robot grabbing and learning closed-loop optimization method and system based on visual language feedback

CN122008191ACN 122008191 ACN122008191 ACN 122008191ACN-122008191-A

Abstract

The invention belongs to the field of mechanical arm control, and discloses a robot grabbing and learning closed-loop optimization method and system based on visual language feedback, wherein the method comprises the steps of carrying out three-dimensional reconstruction on a target operation scene and an object; in a simulation environment, generating candidate grabbing tracks comprising diversified grabbing actions through active exploration, performing high-fidelity rendering on the candidate grabbing tracks to generate corresponding visual demonstration data, training grabbing strategies by utilizing the visual demonstration data, deploying the trained grabbing strategies in a real environment to execute grabbing tasks, analyzing execution results of the real grabbing tasks by utilizing a visual language model to generate evaluation feedback signals, optimizing an active exploration process in the simulation environment according to the evaluation feedback signals, and generating new visual demonstration data based on the optimized results to perform strategy training of the next round to form a closed-loop optimization flow. The invention can automatically generate diversified grabbing demonstration and perform closed-loop optimization based on real deployment feedback.

Inventors

JIANG QI
ZHAO GAOLIN
FAN JINQIU
ZHANG WEI

Assignees

山东大学

Dates

Publication Date: 20260512
Application Date: 20260109

Claims (10)

1. The robot grabbing learning closed-loop optimization method based on visual language feedback is characterized by comprising the following steps of: performing high-precision three-dimensional reconstruction on a target operation scene and objects; In a simulation environment, generating candidate grabbing tracks containing diversified grabbing actions through active exploration; performing high-fidelity rendering on candidate grabbing tracks executed in a simulation environment based on a three-dimensional Gaussian splatter technology to generate corresponding visual demonstration data; training the grabbing strategy by utilizing visual demonstration data; The trained grabbing strategy is deployed in a real environment to execute grabbing tasks; analyzing the execution result of the real grabbing task by using the visual language model, generating an evaluation feedback signal, optimizing an active exploration process in the simulation environment according to the evaluation feedback signal, and generating new visual demonstration data based on the optimized result so as to perform strategy training of the next round and form a closed-loop optimization flow.
2. The method for closed-loop optimization of robot gripping learning based on visual language feedback of claim 1, wherein candidate gripping trajectories including diverse gripping actions are generated by active exploration, specifically, a reinforcement learning strategy is operated in a simulation environment, and the action space of the reinforcement learning strategy is designed to cover multiple poses of an end effector to explore and generate structurally different feasible gripping modes.
3. The closed-loop optimization method for robot capture learning based on visual language feedback according to claim 1, wherein the visual language model is used for analyzing the execution result of the real capture task to generate an evaluation feedback signal, specifically, visual observation in the real capture process is input into the visual language model, and a preset text prompt is combined to obtain a score or reward signal which is output by the visual language model and represents capture quality.
4. The method for closed-loop optimization of robot gripping learning based on visual language feedback of claim 3, wherein the active exploration process in the simulation environment is optimized based on the evaluation feedback signal, specifically, the evaluation feedback signal is used as a part of the reinforcement learning reward function to update the reinforcement learning strategy for generating candidate gripping trajectories so as to tend to generate gripping actions more consistent with the evaluation feedback signal.
5. The method of claim 1, wherein the closed loop optimization process is performed in a plurality of iterations, and visual demonstration data generated by each iteration is accumulated to construct a demonstration database that is continuously expanded and optimized in both the grasping motion space and the visual observation space.
6. A closed-loop optimization method for robotic grasping learning based on visual language feedback according to claim 3, wherein the scoring or rewarding signal is used to comprehensively evaluate grasping quality in combination with grasping success rate, object stability, user preference or task specific constraints.
7. Robot snatchs study closed loop optimization system based on visual language feedback, which is characterized by comprising: The three-dimensional reconstruction module is configured to carry out high-precision three-dimensional reconstruction on the target operation scene and the object; the simulation exploration module is configured to generate candidate grabbing tracks containing diversified grabbing actions through active exploration in a simulation environment; The high-fidelity rendering module is configured to perform high-fidelity rendering on the scene of the candidate grabbing track executed in the simulation environment based on the three-dimensional Gaussian splatter technology, and generate corresponding visual demonstration data; A strategy training module configured to train the capture strategy using visual demonstration data; the real deployment module is configured to deploy the trained capture strategy in a real environment to execute a capture task; The closed-loop control module is configured to analyze the execution result of the real grabbing task by utilizing the visual language model, generate an evaluation feedback signal, optimize the active exploration process in the simulation environment according to the evaluation feedback signal, and generate new visual demonstration data based on the optimized result so as to perform the strategy training of the next round and form a closed-loop optimization flow.
8. An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method of any one of claims 1-6.
9. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of any of claims 1-6.
10. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-6.

Description

Robot grabbing and learning closed-loop optimization method and system based on visual language feedback Technical Field The invention relates to the technical field of mechanical arm control, in particular to a robot grabbing learning closed-loop optimization method and system based on visual language feedback. Background In the field of robot operations, acquisition of high quality demonstration data is a key to improving performance of simulated learning or reinforcement learning algorithms. The traditional method relies on manual teaching or repeated experiments of a real robot, is high in cost and is difficult to cover complex and diverse task scenes. For this reason, generating exemplary data using a simulation environment has become a mainstream research direction. Through simulation, the action track with the mark can be generated in a large scale at low cost. To bridge the visual differences between simulations and Real environments (Simulation to Real, sim2Real problem), researchers introduced high-fidelity techniques based on neural rendering, such as neural radiation field (Neural RADIANCE FIELDS, NERF) and three-dimensional gaussian splatter (3D Gaussian Splatting,3DGS). These techniques enable the reconstruction of three-dimensional scenes with high realism and illumination consistency from multi-view images and the rendering of realistic image sequences, thereby generating training data with visual distribution closer to the real world. However, the existing data generation method based on high-fidelity rendering has significant defects: 1. the diversity of the grabbing actions is lacking, namely the data enhancement of the existing method is mainly focused on the visual level, such as the application of random disturbance of the gesture, the illumination, the background or the camera view angle to the object, but the structural diversity of the grabbing actions is ignored. In the generated exemplary data, the relative grip gesture between the hand and the target object is typically fixed or limited in variation. This results in a single trained gripping strategy space, and inadequate generalization and robustness in the face of small changes in object pose, shape, or different operational constraints in the real environment. 2. The method is lack of a feedback-driven closed-loop optimization mechanism, and most of the existing data generation flow is of a unidirectional open-loop structure of 'demonstration acquisition, high-fidelity rendering and strategy training'. The execution effect (such as failure cases, success patterns, human preferences) of the strategies after deployment in the real environment cannot be effectively fed back to the data generation stage. Therefore, the system cannot purposefully supplement weak link data or optimize the distribution of the grabbing strategies according to real-world feedback, and continuous self-adaptive learning and performance improvement are difficult to realize. Disclosure of Invention In order to solve the problems, the invention provides a robot grabbing learning closed-loop optimization method and a system based on visual language feedback, which can automatically generate diversified grabbing demonstration and perform closed-loop optimization based on real deployment feedback. In order to achieve the above purpose, the present invention adopts the following technical scheme: In a first aspect, the invention provides a robot gripping learning closed-loop optimization method based on visual language feedback, which comprises the following steps: performing high-precision three-dimensional reconstruction on a target operation scene and objects; In a simulation environment, generating candidate grabbing tracks containing diversified grabbing actions through active exploration; performing high-fidelity rendering on candidate grabbing tracks executed in a simulation environment based on a three-dimensional Gaussian splatter technology to generate corresponding visual demonstration data; training the grabbing strategy by utilizing visual demonstration data; The trained grabbing strategy is deployed in a real environment to execute grabbing tasks; analyzing the execution result of the real grabbing task by using the visual language model, generating an evaluation feedback signal, optimizing an active exploration process in the simulation environment according to the evaluation feedback signal, and generating new visual demonstration data based on the optimized result so as to perform strategy training of the next round and form a closed-loop optimization flow. As an alternative implementation mode, candidate grabbing tracks comprising diversified grabbing actions are generated through active exploration, specifically, a reinforcement learning strategy is operated in a simulation environment, and the action space of the reinforcement learning strategy is designed to cover various poses of an end effector so as to explore and generate structurally d