CN-115755593-B - Cooperative multi-agent control method and device based on value function supervision

CN115755593BCN 115755593 BCN115755593 BCN 115755593BCN-115755593-B

Abstract

The invention discloses a cooperative multi-agent control method based on value function supervision, which comprises the steps of initializing a simulation environment, loading model parameters of a formation level controller, loading model parameters of a platform level controller, enabling each platform to share the same model parameters of the platform level controller, selecting formation level decision action instructions by using the formation level controller, selecting the platform level decision action instructions under supervision of the formation level decision action instructions by using the platform level controller, and executing joint action instructions of a plurality of agents by the simulation environment to repeat the steps to complete control of the multi-agents. The invention guides the decision of each platform level agent by means of the coordination control of the formation level and the upper and lower layers of instructions of the platform level, and improves the stability of the performance.

Inventors

WANG ZHENJIE
LIU JUNTAO
GAO ZIWEN
WANG YUANBIN
Jin Lican

Assignees

中国船舶集团有限公司第七〇九研究所

Dates

Publication Date: 20260512
Application Date: 20220928

Claims (8)

1. The cooperative multi-agent control method based on value function supervision is characterized by comprising the following steps of: s1, initializing a simulation environment, loading model parameters of a formation level controller, loading model parameters of a platform level controller, and sharing the same model parameters of the platform level controller by each platform; s2, using a formation level controller to select formation level decision action instructions; The method specifically comprises the steps of (1) selecting a platform-level decision action instruction by using a platform-level controller under the supervision of a formation-level decision action instruction, training a strategy network of the formation-level controller in the step (2) by using an AC (Actor-Critic) algorithm after the platform-level controller is trained, and updating network parameters, wherein the training method for the strategy network of the formation-level controller comprises the following steps: Step 2.1 initializing a simulation environment and initializing a strategy network of a formation level controller Parameters of (2) Initializing an evaluation network Loading a trained platform-level controller value function network Parameters of (2) ; Step 2.2 every time step Obtaining a formation level state vector Policy network input to formation level controller Outputting the formation level decision action instruction ; Step 2.3, for each time step, obtaining the local observation state of the intelligent agent observed by the intelligent agent i for each intelligent agent i Inputting the local state-action value function of the platform-level intelligent agent to the platform-level controller value function network ; Step 2.4, calculating the supervised result according to the formation level decision result Value: selecting according to greedy strategy Action with maximum median value as platform-level intelligent agent Action instruction of (a) ; Step 2.5 combining actions generated by all the agents Inputting the received data into the environment to execute and return rewards r; repeating steps 2.1-2.5 every time step Updating an evaluation network Parameters (parameters) Until the simulation environment of the round is finished; Wherein: , , forming a level state vector for the next time; Step 2.6 updating the policy network of the formation level controller Parameters of (2) : Wherein, the Is the learning rate; step 2.7, ending the learning process after convergence or the maximum iteration number is reached, otherwise, returning to the step 2.2; S4, executing joint action instructions of multiple agents in simulation environment , And (4) repeating the steps S2 to S4 for the number of the platform-level intelligent agents, so as to complete the control of the multiple intelligent agents.
2. The collaborative multi-agent control method based on value function supervision according to claim 1, wherein step S2 specifically comprises: S21 every time step Obtaining a formation level state vector ; S22 input formation level state vector Outputting formation level decision action instructions, namely platform level agent value function supervision vectors, by the formation level controller , For the policy function of the queue level controller, Is a parameter of the water-based paint, The number of executable actions for the platform level agent.
3. The collaborative multi-agent control method based on value function supervision according to claim 1 or 2, wherein step S3 specifically comprises: S31, for each time step, for each platform-level agent i, obtaining the local observation state of the agent observed by the agent i ; S32, inputting the local observation state of each platform-level intelligent agent i Outputting, by the platform-level controller, a local state-action value vector of the platform-level agent , wherein, Actions executable for the platform-level agent; The number of executable actions for the platform-level agent; Is in a state of Time selection action State-action value at time; as a function of the value of the platform level controller, Is a parameter thereof; s33 using formation level decision action instruction Updating local state-action value vectors for platform-level agents The value is , ; S34 selecting according to greedy strategy Action with maximum median value as platform-level intelligent agent Action instruction of (a) 。
4. The collaborative multi-agent control method based on value function supervision according to claim 1 or 2, wherein the value function network of the platform level controller in step S3 is trained using an existing multi-agent reinforcement learning method VDN (Value Decomposition Networks, value function decomposition network) algorithm to update network parameters.
5. The collaborative multi-agent control method based on value function supervision of claim 4, wherein the training method of the value function network of the platform level controller specifically comprises: step 1.1, initializing a simulation environment and initializing a platform-level controller value function network Parameters of (2) ; Step 1.2, obtaining the local observation state of the intelligent agent observed by the intelligent agent i for each platform-level intelligent agent i according to each time step ; Step 1.3 inputting the local observation state of the agent Outputting, by the platform-level controller, a local state-action value function of the platform-level agent , , The number of executable actions for the platform-level agent; step 1.4 Using formation level decision action instruction Updating local state-action value vectors for platform-level agents The value is , ; Step 1.5 selecting according to greedy strategy Action with maximum median value as platform-level intelligent agent Action instruction of (a) ; Step 1.6 the simulation Environment performs the joint actions of multiple agents , Is the number of the platform level intelligent agent and returns global rewards Observation state of each agent at next moment The next time is stored The tuple is put into a training data buffer zone D, the steps 1.2 to 1.6 are repeated, the collected data are put into the D, and the data are stopped after reaching a termination condition or the maximum times; Step 1.7 after the training data buffer D stores a certain amount, M sample data are randomly taken out from the D, and a VDN method is used for updating a value function network Parameters of (2) ; And step 1.8, ending the learning process after convergence or the maximum iteration number is reached, and otherwise, returning to the step 1.2.
6. The collaborative multi-agent control method based on value function supervision according to claim 5, wherein the step 1.7 updates a value function network using a VDN method Parameters of (2) The method specifically comprises the following steps: the global value function is calculated according to the following formula: Wherein the method comprises the steps of A function of the global value of the function, As a global environmental state vector of the device, In order to act in combination with each other, Is an intelligent body Is a function of the local state-action value of (c), ; The loss function is constructed as follows: Wherein, the Wherein For the global state at the next moment in time, For the next time joint action, the following formula is updated Wherein The learning rate is: 。
7. the collaborative multi-agent control method based on value function supervision of claim 1 or 2, wherein the simulation environment is a 5sV6z scenario in interplanetary dispute II.
8. A cooperative multi-agent control device based on value function supervision is characterized in that: Comprising at least one processor and a memory connected by a data bus, the memory storing instructions for execution by the at least one processor, the instructions, when executed by the processor, for performing the value function supervision-based collaborative multi-agent control method of any one of claims 1-7.

Description

Cooperative multi-agent control method and device based on value function supervision Technical Field The invention belongs to the technical field of multi-agent reinforcement learning, and particularly relates to a cooperative multi-agent control method and device based on value function supervision. Background In recent years, with the breakthrough progress of single-agent deep reinforcement learning technology, development and research of multi-agent reinforcement learning are promoted, and in actual situations, a plurality of independent decision-making agents often exist, so that the research of multi-agent reinforcement learning has very important application value. The interaction process of the complete cooperation type multi-agent and the environment is shown in figure 1, and the process of the environment comprises n independent decision-making agents, wherein (1) at time t, the agent i perceives the current environment state s t to obtain the local observation information of the agent i(2) The intelligent agent is based on the current local observation informationAnd the currently adopted strategy selects an action from the action space AForm a combined action(3) When the joint action of multiple agents acts on the environment, the environment transitions to a new state s t+1 and gives a global rewards value r t, and loops around. Where rewards refer to feedback signals from the assessed nature of the environment that the agent obtains in its interaction with the environment. The agent determines how to take a series of behavioral actions in the environment by reinforcement learning methods to maximize the long-term cumulative return. Multi-agent reinforcement learning can be divided into three modes of centralized learning, independent learning and centralized training-distributed execution combined by the two modes according to a training architecture. Independent Learning refers to the use of reinforcement Learning algorithms for each agent alone, with other agents being considered part of the environment, representative of such methods being IQL (INDEPENDENT Q-Learning), in which each agent independently performs a Q Learning algorithm that is easier to implement and has some effect in small scale problems in discrete state-action space, but cannot accommodate large scale complex problems without considering interactions between agents. Centralized learning refers to a reinforcement learning method that integrates the states and actions of all agents together to form a global state space and a joint action space to integrally use a single agent. The method solves the problem of environmental instability due to global consideration, is easy to train and enables multiple intelligent agents to cooperate better. However, in the method, no obstacle is required to be caused in information interaction between the agents, and in the execution process, the global situation needs to be collected first to make a decision and then sent to each agent for execution, so that the problem of efficiency delay exists. The centralized training-distributed execution is a multi-agent reinforcement learning algorithm training architecture which is most commonly used at present, combines the advantages of centralized learning and independent learning, wherein all agents adopt the centralized architecture for training during training, each agent can acquire information of other agents through an unlimited open channel, and each agent can perform action decision according to own local observation information and limited communication during the execution stage after the training is finished. However, such methods have a problem of unstable performance in distributed execution due to the introduction of global information in training. Disclosure of Invention Aiming at the problems, the invention provides a cooperative multi-agent control method based on value function supervision, which proposes to introduce formation-level agents in a distributed execution stage, and guide the agents to select actions through the formation-level agents so as to improve stability. Meanwhile, as formation-level macroscopic decision actions are introduced in the execution stage, the adaptability of multiple agents can be improved, and the dependence on the accuracy of the platform-level agent value function during centralized training is reduced. To achieve the above object, according to one aspect of the present invention, there is provided a cooperative multi-agent control method based on value function supervision, comprising the steps of: s1, initializing a simulation environment, loading model parameters of a formation level controller, loading model parameters of a platform level controller, and sharing the same model parameters of the platform level controller by each platform; s2, using a formation level controller to select formation level decision action instructions; s3, selecting a platform-level decision action instruction u