CN-121981155-A - Robot general strategy distillation method for reinforcement learning closed-loop evolution
Abstract
The invention relates to the technical field of robot general strategy distillation, and particularly discloses a robot general strategy distillation method for reinforcement learning closed-loop evolution, which comprises a simulation environment module, a reinforcement learning training module, a strategy distillation module and a real-world deployment and data collection module, wherein the simulation environment module is used for constructing a highly parameterized and randomizable virtual physical environment, the reinforcement learning training module is used for autonomously generating expert strategy data in the simulation environment, and the strategy distillation module is responsible for compressing and transferring the capacity of the reinforcement learning training module into a lightweight general strategy. According to the reinforcement learning closed-loop evolution robot universal strategy distillation method, the average task success rate of the obtained universal strategy on a plurality of tasks is quantitatively improved through distilling the reinforcement learning generated super expert data, and the level equivalent to that of a carefully-calibrated special controller is reached.
Inventors
- YANG YUDONG
- LIU ZHANZHU
- LIANG XUESONG
- ZHAO SHUANG
- HAN YUWEI
- LU SIBO
Assignees
- 长春市万易科技有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20251225
Claims (6)
- 1. A method of robot-generic strategic distillation for reinforcement-learning closed-loop evolution, comprising: The simulation environment module is used for constructing a highly parameterized and randomizable virtual physical environment; The reinforcement learning training module is used for autonomously generating expert strategy data in the simulation environment; The strategy distillation module is responsible for compressing and migrating the capability of the reinforcement learning training module into a lightweight general strategy; the real world deployment and data collection module is responsible for applying the trained strategy to reality and collecting feedback data to form a starting point of an evolutionary closed loop; a closed loop optimization module that utilizes a real world feedback driven system to continue evolution; The specific working process is as follows: Firstly, generating expert data through RL in simulation; then obtaining a general strategy through distillation and deploying the general strategy to reality; And finally, data are collected from reality and fed back to simulation to start a new round of optimization training, so that an autonomous evolution closed loop integrating simulation and reality is formed.
- 2. The method for distilling a universal strategy for a robot for reinforcement learning closed-loop evolution according to claim 1, wherein the specific configuration of the simulation environment module comprises: Task scene modeling, namely accurately modeling a robot body, an operation object and an environment element in a simulator according to a target task of the robot; A domain randomization unit for dynamically, randomly modifying simulation environment parameters during a training process, including, but not limited to: Visual properties including object texture, color, illumination angle and intensity, and camera noise; Physical properties including object mass, friction coefficient, damping coefficient, joint torque limit; geometrical properties, namely object shape, size and initial pose; the environment layout is that the position of the obstacle and the height of the workbench are arranged; And the simulation-reality interface is used for providing a data exchange channel with the real world and can receive real data and reconstruct a simulation scene according to the real data.
- 3. The method for distilling a general strategy for a robot for reinforcement learning closed-loop evolution according to claim 1, wherein the specific workflow of the reinforcement learning training module is as follows: initializing an intelligent agent, namely constructing an RL intelligent agent and a strategy network thereof Is a deep neural network, the input is environment state observation s_t, and the output is action ; Bonus function design, designing a composite bonus function Comprising: Sparse rewards, namely, giving a large rewards when the task is successful; Dense rewards, namely, intermediate rewards for guiding the agent to advance towards the target; Regularization punishment, namely encouraging safe, efficient and energy-saving behaviors; Interaction and data Generation the agent optimizes its strategy by reinforcement learning algorithm through massive interactions with the simulation environment to maximize the jackpot, during which the system screens and saves high performance trajectory data to form an expert dataset Screening criteria include successful trajectories, high jackpot trajectories, and trajectories that are effective to handle rare conditions or recover from errors.
- 4. The method for distilling a general strategy of a robot for reinforcement learning closed-loop evolution according to claim 1, wherein the specific steps of the strategy distilling module are as follows: Student network construction, namely constructing a general strategy network for deployment The structure is generally simpler than a reinforcement learning strategy network to ensure the reasoning efficiency; Training by distillation with expert data set Training by minimizing negative log likelihood loss function as supervisory signal : ; Optional multitasking distillation: to multiple tasks Merging and co-training a multi-task generic strategy Giving it the ability to perform a range of related skills.
- 5. The method for distilling a robot universal strategy for reinforcement learning closed-loop evolution according to claim 1, the real world deployment and data collection module is characterized by comprising the following working procedures: policy deployment to distill general policies Deploying to a real robot system; And the operation monitoring and data collection, namely, synchronously recording the operation data of the robot when the robot executes tasks, comprising the following steps: state-action sequence ; A task completion result; an uncertainty measure; All data is summarized as a real world dataset Among other things, the collection of failure cases and high uncertainty states is of particular concern.
- 6. The robot generic strategic distillation method of reinforcement learning closed-loop evolution of claim 1, wherein the closed-loop optimization module works as follows: Scene reconstruction and targeted training Key data such as failure cases in the reinforcement learning training module are transmitted back to the simulation environment module, based on the data, simulation environment parameters are adjusted or specific challenging scenes are set, then the reinforcement learning training module is restarted, the reinforcement learning agent is guided to learn to solve the new problems in a targeted manner, and new generation expert data is generated ; Iterative fine tuning using For deployed generic policies Incremental fine tuning is performed to obtain a new version strategy with enhanced performance And completing one evolution cycle.
Description
Robot general strategy distillation method for reinforcement learning closed-loop evolution Technical Field The invention relates to the technical field of robot general strategy distillation, in particular to a robot general strategy distillation method for reinforcement learning closed-loop evolution. Background How a robot learns and grasps general, robust and high-performance behavior strategies is a core challenge in the fields of artificial intelligence and robotics. In recent years, imitation learning based on human demonstration data, particularly behavior cloning, has become an important paradigm for training a general strategy for robots. According to the method, the robot can quickly obtain the initial task execution capacity by directly learning the demonstration data provided by human expert, so that the dependence on complex reward function design is reduced. However, learning of robot-generic strategies has long relied on human demonstration data such as behavioral cloning and imitative learning methods. While such approaches reduce the threshold for policy learning, there are significant bottlenecks. First, policy performance is limited by the level of the presenter, making it difficult to achieve the accuracy and robustness of a specific controller. Secondly, the coverage of human data is limited, so that strategy generalization capability is weak, and the strategy is easy to fail when facing environmental disturbance, new objects or dynamic scenes. In addition, the acquisition cost of high-quality demonstration data is extremely high, and the possibility of strategy expansion to a large number of tasks is severely restricted due to the dependence of expert operation and complex equipment. To alleviate the above problems, prior studies have attempted to combine simulated learning with reinforcement learning, or to process static data with offline reinforcement learning. However, these methods still use human data as a core, and fail to fundamentally break through the limitations of data quality, diversity and coverage, the reinforcement learning is only to assist in optimization, and the final policy performance is still limited by the initial data set. Disclosure of Invention In order to solve the technical problems, the invention is realized by the following technical scheme that the method for distilling the robot by using the general strategy for strengthening the learning closed-loop evolution comprises the following steps: The simulation environment module is used for constructing a highly parameterized and randomizable virtual physical environment; The reinforcement learning training module is used for autonomously generating expert strategy data in the simulation environment; The strategy distillation module is responsible for compressing and migrating the capability of the reinforcement learning training module into a lightweight general strategy; the real world deployment and data collection module is responsible for applying the trained strategy to reality and collecting feedback data to form a starting point of an evolutionary closed loop; a closed loop optimization module that utilizes a real world feedback driven system to continue evolution; The specific working process is as follows: Firstly, generating expert data through RL in simulation; then obtaining a general strategy through distillation and deploying the general strategy to reality; And finally, data are collected from reality and fed back to simulation to start a new round of optimization training, so that an autonomous evolution closed loop integrating simulation and reality is formed. Preferably, the specific configuration of the simulation environment module includes: Task scene modeling, namely accurately modeling a robot body, an operation object and an environment element in a simulator according to a target task of the robot; A domain randomization unit for dynamically, randomly modifying simulation environment parameters during a training process, including, but not limited to: Visual properties including object texture, color, illumination angle and intensity, and camera noise; Physical properties including object mass, friction coefficient, damping coefficient, joint torque limit; geometrical properties, namely object shape, size and initial pose; the environment layout is that the position of the obstacle and the height of the workbench are arranged; And the simulation-reality interface is used for providing a data exchange channel with the real world and can receive real data and reconstruct a simulation scene according to the real data. Preferably, the specific workflow of the reinforcement learning training module is as follows: initializing an intelligent agent, namely constructing an RL intelligent agent and a strategy network thereof Is a deep neural network, the input is environment state observation s_t, and the output is action; Bonus function design, designing a composite bonus functionComprisin