CN-116663651-B - Model-based robot learning method for generating countermeasure interaction imitation learning

CN116663651BCN 116663651 BCN116663651 BCN 116663651BCN-116663651-B

Abstract

The invention relates to a model-based robot learning method for generating countermeasure interaction imitation learning, which combines the advantages of model-based reinforcement learning, generated countermeasure imitation learning and interactive reinforcement learning to form a model-based method for generating countermeasure interaction imitation learning MAILDH, so that the problem of low robot learning speed is solved. First, a forward dynamic model is simultaneously learned in the process of generating the countermeasure imitation learning, and simulation data is generated by using the model to train a generator and a discriminator part in the countermeasure imitation learning framework so as to improve the sample utilization rate. In addition, unlike the traditional generation of the countermeasure imitation study, the invention also combines the judgment of human on robot behaviors, weakens the restriction of demonstration quality on robot learning performance, achieves or surpasses the demonstration performance effect, learns a better control strategy, and can improve the strategy stability and be suitable for large-scale complex task control.

Inventors

LI GUANGLIANG
HAO JIANGSHAN
HUANG JIE

Assignees

中国海洋大学

Dates

Publication Date: 20260508
Application Date: 20230420

Claims (9)

1. The robot learning method for generating the countermeasure interaction imitation learning based on the model is characterized in that the robot learning party is applied to the field of social robot task learning, and the robot learning method comprises the following steps: sampling from expert strategy to obtain expert track, and initializing network parameters of robot strategy, discriminator, human rewarding model and forward dynamic model; the robot pre-trains a human rewarding model by using a demonstrated sample and a random sample extracted in a task; The robot samples from the current strategy to obtain a generated track, samples from the forward dynamic model to obtain a simulated track, judges whether human feedback is received, if so, updates the human rewarding model, and if not, updates the discriminator network and the forward dynamic model; The robot comprehensively utilizes the loss function extracted from the identifier network and human feedback output by the human rewarding model to perform interactive reinforcement learning, judges whether a stable strategy is learned or not, if so, ends, and if not, updates the robot strategy; The model-based generated challenge interaction simulation learning method further comprises model-based reinforcement learning, generated challenge interaction simulation learning and interaction reinforcement learning framework forming model-based generated challenge interaction simulation learning MAILDH, wherein MAILDH learns a forward dynamic model in the training process to generate simulation data, a data training generator and a discriminator for real interaction with the environment, and is further used for training a human rewarding model to enable a robot to simulate learning from a demonstration or simulate learning from human evaluation feedback.
2. The method for learning robots based on model-based generation of antagonistic interaction simulation learning of claim 1, wherein the forward dynamic model is approximated by a feedforward neural network by adopting a Dyna-type reinforcement learning method, wherein each new iteration in the training process, the robots collect samples obtained by real interaction of the current strategy with the environment, and the forward dynamic model is updated by using the samples.
3. The model-based robotic learning method of generating antagonistic interaction simulation learning of claim 2 wherein the human rewards model is further used to present all state-action pairs in a presentation prior to the training process beginning Distributing a personal reward In the task extract Selecting the current state And a random output action And manually distributing rewards ; 、 Is used as input to a model of human rewards, rewards given by a human trainer Is used as a tag to pretrain a human rewards model wherein In response to the status in the presentation, In response to an action in the presentation, In order to be in a state at the present time, For the action selected at the current time, And (5) manually rewarding.
4. A model-based robotic learning method of generating countering interactive simulation learning as claimed in claim 3 wherein the human rewards model loads a pre-trained human rewards model into after training begins, loading all tuples 、 Stored in playback buffer To obtain samples for continued training of the human reward model.
5. The method of model-based robotic learning to generate antagonistic interaction simulation learning of claim 4, wherein the MAILDH uses a forward dynamic model to obtain a simulated trajectory To update the generator and discriminator using the presentation trajectory Real track of robot and environment interaction under current strategy Simulating trajectories Updating the discriminator Loss is reduced ; Wherein the method comprises the steps of Representing the loss function of the discriminator, Representing a trace sample Is the empirical expectation of (i) that , Is a discount factor for 、 In the sense of For the following , ; From the latest discriminator Derived cost function in GAIL and trained human rewards model Will be integrated into a new bonus function: Wherein the method comprises the steps of The cost function of MAILDH is represented as, The cost function of GAIL is represented as, Representing a predicted reward for the human reward model, 、 Representing the state and action of the input respectively; the reward function is used to update policies with PPO The above-described resistance training process is repeated until the desired strategy is obtained.
6. The model-based robotic learning method of generating countering interactive simulation learning of claim 5 wherein MAILDH is a saddle point for finding an expression : Wherein the method comprises the steps of Representation edge policy The resulting trajectory compromises the desire for revenue, Representing the discounted revenue expectations of the presentation trajectory.
7. The model-based robotic learning method of generating countering interactive simulation learning of claim 1, wherein MAILDH further comprises pre-training a human rewards model prior to the beginning of training, wherein the samples used to update the human rewards model are derived in part from pairs of state actions in expert demonstrations and in part from randomly sampled state actions.
8. The model-based robotic learning method of generating challenge interaction simulation learning of claim 1, wherein the model-based robotic learning method further comprises generating a challenge interaction simulation learning portion, a forward dynamic model portion, a human rewards model portion; Wherein the saddle point against which the imitation learning portion finds the expression is generated : , In the formula, Is a strategy A kind of electronic device -A discount causal entropy, Is entropy of Weight of (2); representation edge policy The resulting trajectory compromises the desire for revenue, Representing edge expert policies The generated track compromises the expectations of the benefits; The forward dynamic model part learns a forward dynamic model by using a real sample of the robot interacting with the environment, and generates a simulation sample for updating the discriminator and the generator by using a virtual interaction of the robot and the forward dynamic model.
9. The model-based robotic learning method of generating countering interactive simulation learning of claim 8 wherein the human rewards model portion by minimizing losses Updated human rewards model Wherein Is a function of the loss of the representation, Is an estimated model of the human reward, Is the number of samples that are to be taken, It is the manual prize that is used as a label.

Description

Model-based robot learning method for generating countermeasure interaction imitation learning Technical Field The invention relates to the technical field of artificial intelligence and robot learning, in particular to model-based reinforcement learning, generation of antagonism imitation learning and interactive reinforcement learning, which are used for improving sample utilization rate and robot performance by constructing a model and utilizing evaluation of human trainers, and learning to a better strategy more quickly, and particularly relates to a model-based generation of antagonism imitation learning robot learning method. Background Reinforcement Learning (RL) attempts to let the robot learn an optimal strategy in trial and error learning through interactions with the real world. With the development of deep neural networks, deep Reinforcement Learning (DRL) has achieved great success in many simulation tasks, such as the field of games, mechanical operations, and the like. However, it is difficult, or even impractical, to design an effective reward function for each task, especially in large and high-dimensional environments, which makes conventional DRL methods impractical for use in the real world. In the real world, presentations are easier to provide than well-designed bonus functions, with this idea, imitation learning is proposed. The robot may learn to perform tasks from expert demonstrations consisting of several state-action pairs. Behavior Cloning (BC) is the simplest method in mimicking learning, which maps states to optimal behavior by supervised learning methods. BC, however, requires a large amount of data to learn, cannot generalize to invisible states, and is typically used for policy initialization in reinforcement learning. Another method of mimicking learning is known as inverse reinforcement learning. Learning the strategy by reinforcement learning using a cost function extracted from the expert trajectory. However, many of the proposed inverse reinforcement learning algorithms require a model to solve a series of planning or reinforcement learning problems in the internal loop, which consumes a significant amount of run time and limits their use in large and complex tasks. Furthermore, if the planning problem is not optimal, the performance of the robot may also be significantly degraded. To solve this problem, it is also proposed to generate counterimitative learning (GAIL), a general model-free imitative learning method, which allows robots to learn strategies directly from expert trajectories in large environments and extend inverse reinforcement learning to large environments. However, as a model-free approach, the sample efficiency of GAIL is low in terms of environmental interactions, although a pre-set reward function is not required. In order to improve sample efficiency, model-based reinforcement learning is proposed. Model-based reinforcement learning is more suitable for real life than model-free methods. Model-based reinforcement learning may use existing dynamic models or learned dynamic models to generate simulation data to accelerate policy learning. However, model bias exists in model-based reinforcement learning, which can adversely affect the learning process, slow down or even reduce the learning effect. Furthermore, GAIL's performance is constrained by presentation quality, while interactive reinforcement learning (INTERACTIVERL) has proven to be universally applicable beyond the performance of trainers. Based on the reward molding method in the traditional reinforcement learning, the interactive reinforcement learning can solve the problem of low sampling efficiency in the traditional reinforcement learning and the deep reinforcement learning, and even can enable non-technicians to train by evaluating the behavior of the robot to provide feedback so as to improve the learning effect of the robot. Therefore, if the model-based reinforcement learning, the interactive reinforcement learning and GAIL methods are combined, the sample utilization rate can be improved, and the method can be expanded into large complex tasks. And the stability of the strategy is improved by assuming that the complementary effect of the demonstration and human evaluation feedback is utilized, and a better strategy which is close to or even surpasses the demonstration is obtained. Through the above analysis, the problems and defects existing in the prior art are as follows: (1) Most of the existing inverse reinforcement learning algorithms need to utilize model information in internal circulation, consume a large amount of running time, limit the application of the algorithm in large-scale complex tasks, and greatly attenuate the performance of the algorithm if the planning problem does not obtain an optimal solution. (2) Traditional reinforcement learning, deep reinforcement learning and GAIL sampling have low efficiency, and a more ideal strategy can be learned by multiple intera