CN-121998022-A - Robot reinforcement learning training method and device based on general process reward modeling and electronic equipment

CN121998022ACN 121998022 ACN121998022 ACN 121998022ACN-121998022-A

Abstract

The application provides a robot reinforcement learning training method and device based on general process rewarding modeling and electronic equipment, and relates to the technical field of artificial intelligence and robot control, wherein the method comprises the following steps: acquiring a robot operation demonstration video related to a reinforcement learning training task, and fine-tuning a pre-trained general process rewarding model based on the robot operation demonstration video to obtain a target rewarding model; and executing the reinforcement learning training task based on the target reward model until the reinforcement learning training task is completed. According to the robot reinforcement learning training method and device based on the general process reward modeling and the electronic equipment, the multi-view fusion reward model based on the relative progress jump prediction is adopted, and the potential energy-based strategy-invariant reward shaping mechanism is combined, so that the progress can be accurately estimated under strong shielding, and the robot is guided to quickly learn tasks under few samples without deviating from the target.

Inventors

ZHAO ZHONGXIA
TAN HUAJIE
CHEN SIXIANG
XU YIJIE
ZHANG SHANGHANG

Assignees

北京智源人工智能研究院

Dates

Publication Date: 20260508
Application Date: 20251219

Claims (10)

1. A robot reinforcement learning training method based on a general process rewarding model is characterized by comprising the following steps: Acquiring a robot operation demonstration video related to a reinforcement learning training task, and fine-tuning a pre-trained general process rewarding model based on the robot operation demonstration video to obtain a target rewarding model; Executing the reinforcement learning training task based on the target reward model until the reinforcement learning training task is completed; the universal process rewarding model is a visual language model and is used for predicting relative jump values between videos, and the relative jump values are used for representing relative changes of task completion progress of the robot before and after single-step actions.
2. The method of claim 1, wherein the generic process reward model is trained based on: Acquiring a multi-source data set, dividing each operation video in the multi-source data set into a plurality of subtask fragments, and determining the initial state and the task target state of each subtask fragment, wherein the multi-source data set comprises operation videos of different sources, and the operation videos of different sources comprise at least one of robot operation videos, simulation environment simulation operation videos and real human operation videos; Calculating the relative variation of each image pair as the label of the image pair aiming at a plurality of image frames contained in each subtask fragment, and completing the construction of a training sample set based on the label of each image pair to obtain a target sample set; Training a visual language model by using the target sample set to obtain the general process rewarding model; the visual language model comprises a text instruction for describing a current task and multi-view images, wherein the multi-view images comprise a task initial state atlas, a task target state atlas, a state atlas before operation at the current moment and a state atlas after operation at the current moment, and each state atlas comprises images with multiple views.
3. The method of claim 2, wherein said calculating the relative amount of change for each image pair as a label for the image pair comprises: Judging the change types of the two image frames under the condition that the change amplitude of the two image frames contained in the target image pair is larger than or equal to a preset change threshold value, and calculating the relative change quantity of the two image frames based on the change types; setting the relative change amount of the two image frames to 0 when the change amplitude of the two image frames contained in the target image pair is smaller than the preset change threshold value; Wherein the target image pair is any one of the image pairs included in the plurality of subtask fragments.
4. A method according to claim 3, wherein said determining the type of change of the two image frames and calculating the relative amount of change of the two image frames based on the type of change comprises: Under the condition that the task completion progress is increased after the operation is completed, determining the ratio of the progress increment to the task residual quantity as the relative variation of the two image frames; in the case where the task completion progress decreases after the completion of the operation, the ratio of the progress decrease to the task completion amount is determined as the relative change amount of the two image frames.
5. The method of claim 1, wherein the performing the reinforcement learning training task based on the target reward model until the reinforcement learning training task is completed comprises: Based on the action executed by the robot based on the current strategy network, determining a first state before executing the action and a second state after executing the action; based on the first state and the second state, predicting the global progress at the current moment by using the target rewarding model, and calculating to obtain a global progress value based on a prediction result; calculating a final reward at the current moment based on the global progress value, and carrying out iterative updating on parameters of a strategy network based on the final reward; the final rewards are obtained by calculating a sparse success rewards and a potential energy-based shaping rewards, wherein the potential energy-based shaping rewards are obtained by calculating the task completion progress by utilizing a potential energy-based differential formula.
6. The method of claim 5, wherein predicting the global schedule for the current time using the target rewards model based on the first state and the second state comprises: And inputting the multi-view images corresponding to the first state, the second state, the task initial state and the task target state respectively and the task instruction text into the target rewarding model, and predicting the global progress at the current moment to obtain the relative jump value at the current moment predicted by the target rewarding model.
7. The method according to claim 5 or 6, wherein the calculating a global progress value based on the prediction result comprises: Adding the global progress at the previous moment and the relative jump value of the target rewarding model to obtain a first progress value, inputting the second state and the initial state of the task into the target rewarding model, predicting to obtain a second progress value, and inputting the target state of the task and the second state into the target rewarding model, predicting to obtain a third progress value; and taking the average value of the first progress value, the second progress value and the third progress value as a global progress value.
8. The method according to claim 5 or 6, wherein calculating a final prize for a current time based on the global progress value comprises: If the global progress value exceeds a preset progress threshold, determining a first rewarding value as sparse rewarding, otherwise, determining a second rewarding value as sparse rewarding, wherein the first rewarding value is used for representing that the robot task is completed and the second rewarding value is used for representing that the robot task is not completed; calculating a target product of the discount factor and the global progress estimated value at the next moment, and subtracting the target product from the global progress value to obtain a target difference value; And adding the target difference value and the sparse rewards to obtain the final rewards at the current moment.
9. A robotic reinforcement learning training device based on universal process reward modeling, the device comprising: The data acquisition module is used for acquiring robot operation demonstration videos related to the reinforcement learning training task; the model adapting module is used for fine-tuning a pre-trained general process rewarding model based on the robot operation demonstration video to obtain a target rewarding model; The task execution module is used for executing the reinforcement learning training task based on the target rewarding model until the reinforcement learning training task is completed; the universal process rewarding model is a visual language model and is used for predicting relative jump values between videos, and the relative jump values are used for representing relative changes of task completion progress of the robot before and after single-step actions.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the robot reinforcement learning training method based on universal process reward modeling of any one of claims 1 to 8 when the program is executed.

Description

Robot reinforcement learning training method and device based on general process reward modeling and electronic equipment Technical Field The application relates to the technical field of artificial intelligence, in particular to a robot reinforcement learning training method and device based on general process reward modeling and electronic equipment. Background Reinforcement learning (Reinforcement Learning, RL) is a machine learning method that allows an Agent to learn an optimal strategy by trial and error in interactions with an Environment (Environment), and has become an important means to implement complex robot operation tasks. In reinforcement learning, the design of the reward function (Reward Function) is critical, and sparse rewards are mainly adopted in the related art, namely rewards are only given when the task is successful (for example, success+1, failure 0). However, this approach can result in excessive exploration space and less efficient training in long-range, fine touch tasks. Disclosure of Invention The application aims to provide a robot reinforcement learning training method, a device and electronic equipment based on general process reward modeling, which adopt a multi-view fusion reward model based on relative progress jump prediction, combine a potential energy-based strategy-invariant reward shaping mechanism, realize accurate assessment progress under strong shielding, and guide a robot to quickly learn tasks under few samples without deviating from targets. The application provides a robot reinforcement learning training method based on general process rewarding modeling, which comprises the following steps: The method comprises the steps of obtaining robot operation demonstration videos related to reinforcement learning training tasks, fine-adjusting a pre-trained general process rewarding model based on the robot operation demonstration videos to obtain a target rewarding model, and executing the reinforcement learning training tasks based on the target rewarding model until the reinforcement learning training tasks are completed, wherein the general process rewarding model is a visual language model and is used for predicting relative jump values among videos, and the relative jump values are used for representing relative changes of task completion progress of the robot before and after single-step actions. The universal process rewarding model is obtained by training based on the steps of obtaining a multi-source data set, dividing each operation video in the multi-source data set into a plurality of subtask fragments, determining the initial state and the task target state of each subtask fragment, wherein the multi-source data set comprises operation videos of different sources, the operation videos of different sources comprise at least one of robot operation videos, simulation environment simulation operation videos and real human operation videos, relative change amount of each image pair is calculated as a label of the image pair according to a plurality of image frames contained in each subtask fragment, building a training sample set based on the label of each image pair to obtain a target sample set, each image pair comprises an image frame before operation, the image frames after operation are trained by utilizing the target sample set to obtain the universal process rewarding model, the input of the visual language model comprises text instructions and multi-view images used for describing a current task, the multi-view images comprise a task initial state set, the image sets comprise a plurality of image frames before operation, and the image sets comprise the current state of the task images after operation state, and the image sets comprise the current state of each image set after operation state. Optionally, the calculation of the relative change amount of each image pair is used as a label of the image pair, and the calculation comprises the steps of judging the change type of two image frames and calculating the relative change amount of the two image frames based on the change type when the change amplitude of the two image frames contained in the target image pair is larger than or equal to a preset change threshold value, and setting the relative change amount of the two image frames to be 0 when the change amplitude of the two image frames contained in the target image pair is smaller than the preset change threshold value, wherein the target image pair is any image pair of the image pairs contained in the plurality of subtask fragments. Optionally, the judging the change type of the two image frames and calculating the relative change amount of the two image frames based on the change type comprises determining the ratio of the progress increment to the task remaining amount as the relative change amount of the two image frames when the task completion progress is increased after the operation is completed, and determining the ratio of the progres