CN-121997975-A - Self-adaptive rewarded man-in-loop robot true machine reinforcement learning method

CN121997975ACN 121997975 ACN121997975 ACN 121997975ACN-121997975-A

Abstract

The invention belongs to the technical field of robot reinforcement learning and man-machine interaction intelligent control, and particularly relates to a man-in-loop robot real machine reinforcement learning method capable of self-adaptively rewarding. The method comprises constructing a reward model for supervising the execution of the internal strategy network of the robot, and using stage progress labels Training the marked pre-training data set, and then putting the data set into the robot when rewarding the model When the output judging result is inconsistent with reality, an operator inputs a discrete feedback signal, and records the discrete feedback signal and a corresponding timestamp thereof as a reward correction data set Correcting the data set according to the rewards And performing backtracking labeling to expand sparse feedback into a dense track to update the reward model.

Inventors

YUE YUFENG
ZHOU TIANXING
Ao Haojia
LU HAOYANG
ZHOU ZICHEN
Cui Te
CHEN GUANGYAN
YANG YI

Assignees

北京理工大学

Dates

Publication Date: 20260508
Application Date: 20260115

Claims (10)

1. The method for strengthening learning of the self-adaptive rewarded man-in-loop robot real machine is characterized by comprising the following specific steps of: constructing a reward model for supervising the internal strategy network execution of the robot and utilizing the stage progress label Training the marked pre-training data set and then placing the trained pre-training data set into the robot; The robot uses the current strategy network to execute tasks in the real environment; when the robot is unsafe or deviates from the target trajectory according to the actions of the current strategy network, the operator input controls the generation of intervention actions for replacing the actions of the current strategy network to be performed and write strategy intervention data sets as correct behavior examples The method is used for updating the strategy network; When rewarding model When the output judging result is inconsistent with reality, an operator inputs a discrete feedback signal, and records the discrete feedback signal and a corresponding timestamp thereof as a reward correction data set Correcting the data set according to the rewards And performing backtracking labeling to expand sparse feedback into a dense track to update the reward model.
2. The adaptive rewarding human on loop robot real machine reinforcement learning method of claim 1 wherein said tag is in the form of: Wherein: Is video No The frame of the frame is a frame of a frame, Is the video length.
3. The adaptive rewards human in loop robot real machine reinforcement learning method of claim 2 wherein the data set is corrected based on the rewards The specific process of performing backtracking annotation to expand sparse feedback into dense rewards comprises the following steps: Forward rewards for successful trajectories without human intervention, linear progress interpolation is employed: negative punishment for the trajectory joining human intervention, when human action intervention is at time When it happens, front Stage labels within range employ a decaying penalty: 。
4. The adaptive rewarding human on loop robot learning enhancement method of claim 3 wherein the operator input feedback signal includes a continue signal and a stop signal, the false positive signal indicating a false positive of the reward model that was prematurely successful and the stop signal indicating that the reward model did not detect success.
5. The method of claim 3, wherein the loss function is designed as a weighting of a stepwise regression loss and a success/failure calibration loss upon updating the reward model, The stage regression loss is: , wherein, For the expected value of the output phase of the bonus model, A supervision value generated for the retrospectively marked data set; success/failure calibration loss: Wherein: And Representing the set parameters; When marked as truly successful When marked as unsuccessful =0, Representing the probability that the reward model predicts as successful.
6. The adaptive rewarded human-in-loop robot real machine reinforcement learning method of claim 1 wherein the rewarded model The structure of frozen visual language model and lightweight value head is adopted.
7. The adaptive rewarding human on loop robot real machine reinforcement learning method of claim 1 wherein said pre-training data set is video from the internet or an open source robot data set.
8. The adaptive rewarded human real machine reinforcement learning method of claim 1 wherein before the robot performs the task in the real environment using the current strategy network, it further comprises the following process Robot motion demonstration data acquisition, utilizing the data set Strategy network for robot Training is carried out; acquiring a reward alignment demonstration data set, utilizing the data set The reward model is trained.
9. The adaptive rewarding human-in-loop robot real machine reinforcement learning method of claim 1 wherein the policy network update adopts a soft evaluation-execution architecture SAC, i.e. a human behavior regularization term is introduced in the actor loss function, which regularization term is updated based on behavior clone BC.
10. The adaptive rewarding human-in-loop robot real machine reinforcement learning method of claim 9 wherein the rewarding model output is rewritten as , wherein, Representing the output of the reward model, calculating the instant rewards To be stored in a unified playback buffer and used as supervisory signal for updating an evaluator network in a SAC architecture that learns to evaluate current actions by minimizing bellman errors For pushing task progress To guide policy networks Evolving towards a higher value motion direction.

Description

Self-adaptive rewarded man-in-loop robot true machine reinforcement learning method Technical Field The invention belongs to the technical field of robot reinforcement learning and man-machine interaction intelligent control, and particularly relates to a man-in-loop robot real machine reinforcement learning method capable of self-adaptively rewarding. Background In the prior art, researchers try a plurality of methods to solve the problems of sparse reinforcement learning rewards, difficult rewards design, rewards failure caused by distribution deviation and the like in a real environment, but various methods still have obvious limitations. The following are several typical existing schemes: (1) Direct rewarding reasoning method based on static vision model Many studies have attempted to use pre-trained Visual Language Models (VLMs) or visual classifiers to infer directly through images or video clips whether a task was successful or not, or to output staged scores for use in place of manual rewards designs. For example RoboCLIP learns the reward signal by a small number of demonstrations and can generalize to a certain extent to new tasks, GCR then uses contrast video learning to infer rewards from target demonstrations, adapt2Reward further promotes rewards semantic consistency by visual language model. However, all the methods are trained once in an offline stage, remain static after deployment, and cannot be adjusted in real time according to exploring changes of the robot in a real environment. When visual conditions change or an unseen condition occurs, the reward estimate is prone to drift, resulting in unstable strategy training. (2) Rewarding model learning method based on demonstration video or preference learning Another type of study utilizes video demonstration, preference comparison, or timeline annotation to construct a rewards model. For example, open-X-Embodiment based methods learn generic semantic rewards through massive video demonstration, reWiND infer task phase progress through language condition video comparison, reverse reinforcement learning (IRL) and event-based reverse control methods attempt to extract rewards structures from offline demonstration. However, since such methods rely on extensive demonstration of offline collection, it is difficult to cover diverse changes in the real environment, while models lack online updating capability, rewarding estimates can also suffer from drift problems when the robot explores an unseen state. (3) Human action correction method based on human in-loop Some methods ensure the safety of real robot reinforcement learning by introducing human real-time intervention. For example, HIL-SERL allows the operator to take over control when the strategy deviates from the correct trajectory, thereby improving strategy stability and sample efficiency. With the aid of human intervention data, policy learning can avoid dangerous behavior and converge more effectively. However, such methods focus mainly on Action Alignment (Action Alignment), i.e. by humans providing "correct actions" to adjust the policy itself, rather than updating the reward model. Since rewards remain static and there may be false positives, even if the action is continuously corrected, the policy learning direction may still be affected by false rewards. (4) Strategy stabilization method based on auxiliary rewards or structured rewards Some research has focused on improving training stability by designing potential reward functions, building structured reward modules, or using model assistance signals. For example, the potential function rewards shaping may improve training efficiency in a sparse rewards environment, the event triggered rewards structure may enhance task phase identification, and the auxiliary rewards may mitigate the effects of rewards noise or delay. However, these rewards all rely on manual design or static training and do not have dynamic adjustment capabilities. When the environment changes in distribution, the rewards are difficult to maintain effective, and the long-term deployment of the real robot is not reliable enough. (5) Rewards assessment method based on value learning or preference ordering Still other studies use a cost function, preference ranking, or reinforcement learning approach to evaluate plans or behaviors. Such as preference learning based systems, use human labels to determine which behavior is better for training the bonus model, while RLHF related methods gradually calibrate the strategy or bonus model by human preference. However, these methods often require a large number of environmental interactions or preference samples, which are difficult to deploy efficiently on real robots. Furthermore, these models lack deep binding to visual semantics, resulting in insufficient reward robustness under visual disturbances. Through systematic analysis of the prior art, although various rewarding modeling, man-machine interaction learning and visu