CN-116341777-B - Multi-robot collaborative trapping method based on maximum entropy reinforcement learning
Abstract
The invention provides a multi-robot collaborative trapping method based on maximum entropy reinforcement learning. The method comprises the steps of establishing a chase countermeasure scene, designing MASAC algorithm suitable for multiple robots based on the established chase countermeasure scene and a multi-agent reinforcement learning algorithm of SAC, designing a reward function based on a multi-robot trapping strategy of course learning, and combining the designed reward function with the MASAC algorithm to obtain the multi-robot collaborative trapping strategy algorithm based on MASAC. The method adopts a multi-agent reinforcement learning algorithm based on maximum entropy to realize collaborative trapping of a plurality of mobile robots on a single target in a two-dimensional scene, divides the trapping stage into four stages of searching, trapping, transferring and capturing, adopts the concept of course learning to design a reward function and a conversion condition for each stage, verifies the effectiveness of the deep reinforcement learning method in trapping the single-escape robots which move rapidly by the multi-pursuit robots, and improves the efficiency.
Inventors
- LIU ZHONGCHANG
- DAI BING
- LIU TIANHE
- YUE WEI
Assignees
- 大连海事大学
Dates
- Publication Date
- 20260508
- Application Date
- 20230216
Claims (5)
- 1. The multi-robot collaborative trapping method based on maximum entropy reinforcement learning is characterized by comprising the following steps of: Establishing a chase countermeasure scene; Based on the established multi-agent reinforcement learning algorithm of the chase countermeasure scene and SAC, designing MASAC algorithm suitable for the multi-robot, comprising: the reinforcement learning SAC algorithm is extended to MASAC algorithm suitable for multiple robots by using a framework of centralized training and decentralized execution, and specifically comprises the following steps: the empirical pool of MASAC was designed as: Wherein, the Represents the set of observations of all robots at time t, Represents the set of actions of all robots at time t, Indicating the rewards obtained after all robots have performed their respective actions at time t, The observation value set of all robots at the time t+1 is represented; designing an Actor network and two Critic networks for each robot by using a basic Actor-Critic network framework, wherein the Actor network is used for learning a motion strategy, namely determining the motion direction and acceleration of the next step according to the current position and speed of the robot, and the Critic network is used for judging the quality of the learned strategy, namely judging the quality of the motion strategy adopted by the current state according to the current state comprising the position and the speed and the strategy adopted by the robot; in SAC, entropy regularization is introduced to maximize the expectation of the objective function, whose value function is: Wherein, the As a discount factor, the number of times the discount is calculated, As a function of the temperature parameter(s), Is a strategy In MASAC, assuming that the policies between robots are independent of each other, the entropy of the joint policy is: in the strategy evaluation stage, the Q value function is updated based on the Belman optimal equation, and the learned objective function is as follows: Wherein, the Is a parameter of the Critic network and, Is a parameter of the target Critic network, Is an Actor network parameter, and D is empirical data, namely data in an empirical playback pool; According to the objective function, updating the network parameters of Critic by adopting a random gradient descent method: Wherein, the ; In the strategy improvement stage, a random gradient rising method is adopted to update the parameters of the Actor network, and the learned objective function is as follows: according to the above defined objective function, the gradient of the objective function is: the temperature parameter updating mode is as follows: And finally, updating the target network parameters by using a moving average method, wherein the updating method is used for ensuring the stability of an algorithm and comprises the following steps of: ; multiple robot trapping strategies based on course learning, designing a reward function, including: designing a whole rewarding function: The overall reward function for each pursuit robot i in the t-th time step is expressed as: Wherein, the In order to capture the rewards, The mutual collision prevention and rewarding of the robots are realized, The method is used for rewarding collision prevention of scene boundaries and guiding the robot to avoid being too close to the boundaries of the motion scene; Design of the trapping reward function The design process is as follows: The cooperative trapping process of the multi-robot is divided into four states, namely searching, trapping, transferring and capturing, by utilizing the thought of course learning, each state corresponds to a subtask which is easier to complete, each subtask is completed in sequence until the final capturing task is completed, and the four subtasks correspond to four reward functions respectively 、 、 、 First let Equal to So that the pursuing robot learns how to form a formation to surround the escape robot, then makes Equal to So that the pursuing robot surrounds the escaping robot to meet the trapping condition, then the following robot is made to follow Equal to The pursuit robot will learn to shrink the surrounding ring and finally let Equal to The pursuing robot moves to the escaping robot until the capturing task is completed; Designing mutual collision prevention reward function of robots The function is defined as follows: Wherein, the In order for the distance to be a safe distance, Is a negative constant; designing scene boundary collision avoidance reward function The function is defined as follows: combining the designed reward function with MASAC algorithm to obtain MASAC-based multi-robot collaborative enclosure strategy algorithm.
- 2. The multi-robot collaborative trapping method based on maximum entropy reinforcement learning according to claim 1, wherein the established pursuit countermeasure scene comprises N pursuit robots P and a single escape robot E.
- 3. The multi-robot collaborative trapping method based on maximum entropy reinforcement learning according to claim 1, wherein the input of the Actor network of each robot is own state information, the state information comprises position and speed, the output is a strategy adopted according to the current state, the input of the Critic network of each robot is the state information of all robots and the action executed, the output is a state action value function, namely a Q value, the Actor network and the Critic network are respectively composed of three full-connection layers, the number of neurons of a hidden layer is 64, the activation function of the first two layers of neural networks is RECTIFIED LINEAR Unit functions, and the last layer of neural network does not use the activation function.
- 4. The multi-robot collaborative trapping method based on maximum entropy reinforcement learning according to claim 1, wherein the design trapping reward function The conditions that the specific states should satisfy and the corresponding reward function are defined as follows: search state: when the escape robot is not surrounded by the pursuit robot, the escape robot is in a searching state, and the following formula is satisfied: Wherein, the An area representing the area surrounded by the subscript; The reward function corresponding to the search state is as follows: Wherein, the Indicating the distance from the ith pursuit robot to the escape robot, For the regular term, in order to trap the escaping robot in the searching stage, the pursuing robot may take action to enlarge the enclosed area of the escaping robot, and keep the escaping robot away from the pursuing robot, wherein the regular term is punishment; The trapping state: the following formula is satisfied when in the trapped state: Wherein, the Representing the distance between the ith pursuit robot and the (i+1) th pursuit robot in the close vicinity; Wherein And Maximum speeds of the pursuit robot and the escape robot are respectively represented; the following robot can further take a strategy during the trapping period so as to enter a fourth state, namely a capturing state after at most K steps; the reward function corresponding to the trapping state is as follows: The rewards are positive only when the condition of the trapping state is met, and the larger the rewards are, the more sufficient the time for the robot to shrink the enclosure is; Transition state: the transition state is used to describe a transition state from the trapped state to the trapped state, satisfying the following equation: Wherein, the For one time step, the subtask aims at adjusting the distance between any two pursuing robots to be equidistant, and simultaneously reduces the surrounding circle until reaching the capturing state, and the corresponding reward function is as follows: Wherein, the The term is used to narrow down the envelope, Wherein , The items are for the purpose of Equidistant; Capturing state: the capturing state is the final state that all the chasing robots successfully enclose the escaping robot, no matter what action is selected by the escaping robot, as long as the chasing robot moves to the escaping robot when meeting the capturing state, the chasing robot can capture the escaping robot, and the capturing state meets the following formula: The corresponding capture state reward function is as follows: The reward means that the pursuing robot can get more rewards if it approaches the escaping robot.
- 5. The multi-robot collaborative trapping method based on maximum entropy reinforcement learning according to claim 1, wherein the combining of the designed reward function with MASAC algorithm results in a MASAC-based multi-robot collaborative trapping strategy algorithm, specifically comprising: initializing 2 Critic networks , And corresponding Critic network parameters , Actor network And corresponding Actor network parameters Initializing 2 target Critic network parameters , Initializing an experience playback pool D; initializing the state of the robot; Selecting an operation for each robot i based on the search noise Obtain rewards ; Sample the sample Stored in experience pool D, M samples are randomly sampled from experience pool ; Updating the Q value function; the parameters of the Critic network are updated, ; The parameters of the Actor network are updated and, ; The temperature coefficient is updated and the temperature coefficient is updated, ; The target network parameters are updated by a moving average method, 。
Description
Multi-robot collaborative trapping method based on maximum entropy reinforcement learning Technical Field The invention relates to the technical field of multi-robot collaborative path planning, in particular to a multi-robot collaborative trapping method based on maximum entropy reinforcement learning. Background The multi-robot cooperative trapping strategy has important application value in the scenes of military countermeasure, autonomous search and rescue, and the like, and is one of research hotspots all the time. Most of the current researches on the trapping problem of multiple robots are based on classical control theory, and the trapping strategy is designed by manually setting or optimizing solution according to the mathematical model of the robot, so that the method ignores the difficulty of establishing an accurate mathematical model for the robot in real life and has certain limitation. In recent years, the deep reinforcement learning method in the artificial intelligence field combines the perception capability of deep learning with the decision capability of reinforcement learning, can directly learn a control strategy from high-dimensional original data, has strong universality, and thus has wide research on the problem of cooperative capture of multiple robots. The reinforcement learning SAC method based on the maximum entropy has stronger exploration capability and faster training speed compared with the method based on the deterministic strategy gradient DDPG, and is more advantageous in processing complex tasks. In addition, the existing method based on deep reinforcement learning lacks detailed design of trapping rewards in the research of the trapping problem of multiple robots, the convergence speed of the algorithm is low, and the final trapping success rate is not high enough. Disclosure of Invention According to the technical problem, the multi-robot collaborative trapping algorithm based on deep reinforcement learning is provided, a multi-agent reinforcement learning algorithm based on maximum entropy reinforcement learning SAC is designed, a reward function of a multi-robot trapping strategy based on course learning mechanism is designed, effective trapping of multiple trapping robots to escape robots is realized, and the convergence speed and trapping success rate of the algorithm are improved. The invention adopts the following technical means: A multi-robot collaborative trapping method based on maximum entropy reinforcement learning comprises the following steps: Establishing a chase countermeasure scene; Designing MASAC algorithm applicable to multiple robots based on the established multi-agent reinforcement learning algorithm of the chase countermeasure scene and SAC; Designing a reward function based on a multi-robot trapping strategy of course learning; combining the designed reward function with MASAC algorithm to obtain MASAC-based multi-robot collaborative enclosure strategy algorithm. Further, the established pursuit countermeasure scene includes N pursuit robots P and a single escape robot E. Further, the multi-agent reinforcement learning algorithm based on the established chase countermeasure scene and SAC is designed to be suitable for the MASAC algorithm of the multi-robot, and comprises the following steps: the reinforcement learning SAC algorithm is extended to MASAC algorithm suitable for multiple robots by using a framework of centralized training and decentralized execution, and specifically comprises the following steps: the empirical pool of MASAC was designed as: Wherein, the Represents the set of observations of all robots at time t,Represents the set of actions of all robots at time t,Indicating the rewards obtained after all robots have performed their respective actions at time t,The observation value set of all robots at the time t+1 is represented; Designing an Actor network and two Critic networks for each robot by using a basic Actor-Critic network framework, wherein the Actor network is used for learning a motion strategy, namely determining the motion direction and acceleration of the next step according to the current position and speed of the robot, and the Critic network is used for judging the quality of the learned strategy, namely judging the quality of the motion strategy adopted by the current state according to the current state (including the position and the speed) and the strategy adopted by the robot; in SAC, entropy regularization is introduced to maximize the expectation of the objective function, whose value function is: Wherein, gamma is a discount factor, alpha is a temperature parameter, H (pi (|s t)) is the entropy of the strategy pi, in MASAC, the strategy between robots is assumed to be independent, and the entropy of the combined strategy is: in the strategy evaluation stage, the Q value function is updated based on the Belman optimal equation, and the learned objective function is as follows: wherein θ is a Critic netw