CN-117252252-B - Multi-agent reinforcement learning intelligent decision method and device

CN117252252BCN 117252252 BCN117252252 BCN 117252252BCN-117252252-B

Abstract

The invention provides a multi-agent reinforcement learning intelligent decision-making method and device. The method comprises the steps of determining state vectors of units where a plurality of intelligent agents are located in a target problem under a current time step, inputting the state vectors of adjacent intelligent agents into a graph meaning force network contained in an algorithm model of the target intelligent agents, obtaining corresponding influence weights, carrying out weighted average processing on the state vectors of the adjacent intelligent agents based on the influence weights to obtain corresponding average field vectors, inputting the state vectors and the average field vectors of the target intelligent agents into a actor network contained in the algorithm model of the target intelligent agents, and obtaining corresponding processing decisions of the target intelligent agents so as to control the target intelligent agents to execute corresponding actions according to the processing decisions under the current time step. The method provided by the invention can be suitable for large-scale intelligent decision of the intelligent agents, greatly improves the intelligent decision efficiency and accuracy of multi-intelligent agent reinforcement learning, and effectively improves the decision level of the intelligent agents.

Inventors

LI YONG
Hao Qianyue
HUANG WENZHEN

Assignees

清华大学

Dates

Publication Date: 20260505
Application Date: 20230825

Claims (9)

1. The multi-agent reinforcement learning intelligent decision-making method is characterized by comprising the following steps of: determining state vectors of units where a plurality of agents are located in a target problem under the current time step; traversing each agent, inputting the state vector of an adjacent agent associated with a target agent into a graph annotation force network contained in an algorithm model in the target agent, obtaining the influence weight between a unit of the traversed target agent and a unit of the adjacent agent based on the current time step, and carrying out weighted average processing on the state vector of the adjacent agent based on the influence weight to obtain an average field vector corresponding to the target agent; Inputting the state vector and the average field vector of the target intelligent agent under the current time step into an actor network contained in an algorithm model in the target intelligent agent, and obtaining a processing decision corresponding to the target intelligent agent under the current time step so as to control the target intelligent agent to execute corresponding actions according to the processing decision under the current time step; The algorithm model comprises an actor network and a graph annotation meaning network, which are obtained by training based on a sample state and a sample decision result corresponding to the sample state; The influence weight comprises a weight value which represents the influence degree of the units where the plurality of adjacent agents are located on the units where the target agents are located respectively; the weighted average processing is performed on the state vectors of the adjacent agents based on the influence weights to obtain average field vectors corresponding to the target agents, and the weighted average processing specifically comprises the following steps: And carrying out weighted average processing based on the state vectors of the units where the plurality of adjacent agents are located and the corresponding weight values thereof to obtain an average field vector corresponding to the target agent.
2. The multi-agent reinforcement learning intelligent decision-making method of claim 1, wherein the graph-annotation-force network comprises an actor-key network and an actor-query network; The step of traversing each agent, inputting the state vector of the adjacent agent associated with the target agent into a graph annotation force network contained in an algorithm model in the target agent, and obtaining the influence weight between the unit of the traversed target agent and the unit of the adjacent agent in the current time step, wherein the method specifically comprises the following steps: Traversing each intelligent agent, inputting the state vector of the currently traversed target intelligent agent into an actor key network contained in an algorithm model in the target intelligent agent to obtain an actor key vector of the target intelligent agent, and inputting the state vector of a unit where the adjacent intelligent agent is located into an actor query network contained in the algorithm model in the target intelligent agent to obtain an actor query vector of the adjacent intelligent agent; Calculating based on the actor key vector of the target intelligent agent and the actor query vector of the target intelligent agent to obtain influence weights of a plurality of adjacent intelligent agents on the target intelligent agent respectively, wherein the intelligent agents comprise the target intelligent agent and the adjacent intelligent agents, the target intelligent agent is any one of the plurality of intelligent agents, and the size of the influence weights represents the strength of the association relation between the traversed target intelligent agent and the adjacent intelligent agents.
3. The multi-agent reinforcement learning intelligent decision-making method of claim 1, further comprising, prior to determining the state vector of each element in the target problem at the current time step, determining association relationships of a plurality of agents in the target problem to be processed in advance, and constructing a corresponding agent relationship graph based on the association relationships in a graph structure mode, wherein each agent corresponds to one element in the target problem and one node in the agent relationship graph, and the association relationships comprise spatial distribution relationships and logic association relationships of the agents.
4. The method for multi-agent reinforcement learning intelligent decision-making according to claim 3, wherein determining the state vector of the unit where the plurality of agents are located in the target problem in the current time step specifically comprises obtaining multidimensional state information corresponding to the unit where each agent is located in the current time step based on the agent relation diagram, and generating a corresponding state vector based on the multidimensional state information.
5. The multi-agent reinforcement learning intelligent decision-making method of claim 1, further comprising, prior to traversing each agent: Obtaining a sample state for training the algorithm model and a sample decision result thereof; based on the sample state and the sample decision result thereof, training a actor network and a drawing and meaning network contained in the algorithm model by utilizing a judging device network, a judging device query network and a judging device key network which are preset in the intelligent agent so as to adjust network parameters and obtain the algorithm model meeting preset conditions.
6. The multi-agent reinforcement learning intelligent decision-making method according to claim 2, wherein the calculating based on the actor key vector of the target agent and the actor query vector of the target agent obtains the influence weights of the plurality of adjacent agents on the target agent, respectively, specifically comprises: calculating an inner product of the actor key vector of the target intelligent agent and the actor query vector of the target intelligent agent to obtain an inner product calculation result; and determining the influence weights of the plurality of adjacent agents on the target agent respectively according to the inner product calculation result.
7. The utility model provides a many agent reinforcement study intelligent decision-making device which characterized in that includes: The state vector determining module is used for determining state vectors of units where the plurality of intelligent agents are located in the target problem under the current time step; The average field vector determining module is used for traversing each intelligent agent, inputting the state vector of the adjacent intelligent agent related to the target intelligent agent into a graph annotation force network contained in an algorithm model in the target intelligent agent, obtaining the influence weight between the unit of the traversed target intelligent agent and the unit of the adjacent intelligent agent based on the current time step, and carrying out weighted average processing on the state vector of the adjacent intelligent agent based on the influence weight to obtain the average field vector corresponding to the target intelligent agent; The processing decision obtaining module is used for inputting the state vector and the average field vector of the target intelligent agent under the current time step into a actor network contained in the algorithm model in the target intelligent agent to obtain a processing decision corresponding to the target intelligent agent under the current time step so as to control the target intelligent agent to execute corresponding actions according to the processing decision under the current time step; The algorithm model comprises an actor network and a graph annotation meaning network, which are obtained by training based on a sample state and a sample decision result corresponding to the sample state; The influence weight comprises a weight value which represents the influence degree of the units where the plurality of adjacent agents are located on the units where the target agents are located respectively; the average field vector determining module is specifically configured to: And carrying out weighted average processing based on the state vectors of the units where the plurality of adjacent agents are located and the corresponding weight values thereof to obtain an average field vector corresponding to the target agent.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the multi-agent reinforcement learning intelligent decision method of any one of claims 1 to 6 when the computer program is executed by the processor.
9. A processor readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the multi-agent reinforcement learning intelligent decision method of any of claims 1 to 6.

Description

Multi-agent reinforcement learning intelligent decision method and device Technical Field The invention relates to the technical field of intelligent decision making, in particular to a multi-agent reinforcement learning intelligent decision making method and device. In addition, the invention also relates to an electronic device and a processor readable storage medium. Background Multi-agent reinforcement learning (Multi-agent Reinforcement Learning) is a machine learning algorithm that functions to make intelligent decisions in a scenario containing multiple units, where each agent controls one unit. Among the different problems, complex influencing relationships such as cooperation, competition, resource sharing and the like may exist among the agents. The main difficulty of such problems is that when the number of agents is large, the decision will become extremely complex, resulting in poor efficiency and accuracy of multi-agent reinforcement learning intelligent decision. At present, the prior art mainly includes two types of ways for solving the difficulty, one type is an average field (MEAN FIELD) method, which mainly calculates the characteristic behaviors of all the agents to obtain an average value, and then converts the pairwise interactions between one agent and each other agent into interactions between the agent and the average value, thereby reducing the complexity of calculation. Although the method can simplify the calculation, when the average value is calculated, part of information in the problem to be solved is lost in the mode of treating all the agents without difference, namely, the interaction strength between different agents is different, for example, the interaction between the agents which are adjacent in space is always stronger, and particularly, the performance of the method is limited along with the continuous expansion of the scale of the agents. Therefore, how to design a multi-agent reinforcement learning scheme capable of effectively reducing the computation complexity, thereby being suitable for a plurality of scenes of agents and simultaneously reducing the information loss in the to-be-solved problem as much as possible becomes the current to-be-solved problem. Disclosure of Invention Therefore, the invention provides a multi-agent reinforcement learning intelligent decision method and device, which are used for solving the defects of the prior art that the limitation of the multi-agent reinforcement learning intelligent decision scheme is higher, thereby causing the poor intelligent decision efficiency and accuracy. In a first aspect, the present invention provides a multi-agent reinforcement learning intelligent decision method, comprising determining state vectors of units where a plurality of agents are located in a target problem at a current time step; traversing each agent, inputting the state vector of an adjacent agent associated with a target agent into a graph annotation force network contained in an algorithm model in the target agent, obtaining the influence weight between a unit of the traversed target agent and a unit of the adjacent agent based on the current time step, and carrying out weighted average processing on the state vector of the adjacent agent based on the influence weight to obtain an average field vector corresponding to the target agent; Inputting the state vector and the average field vector of the target intelligent agent under the current time step into an actor network contained in an algorithm model in the target intelligent agent, and obtaining a processing decision corresponding to the target intelligent agent under the current time step so as to control the target intelligent agent to execute corresponding actions according to the processing decision under the current time step; the actor network and the drawing meaning network contained in the algorithm model are obtained by training based on a sample state and a sample decision result corresponding to the sample state. The traversing each agent inputs the state vector of the adjacent agent associated with the target agent into the graph meaning network contained in the algorithm model in the target agent, and obtains the influence weight between the traversed unit of the target agent and the unit of the adjacent agent in the current time step, specifically comprising: Traversing each intelligent agent, inputting the state vector of the currently traversed target intelligent agent into an actor key network contained in an algorithm model in the target intelligent agent to obtain an actor key vector of the target intelligent agent, and inputting the state vector of a unit where the adjacent intelligent agent is located into an actor query network contained in the algorithm model in the target intelligent agent to obtain an actor query vector of the adjacent intelligent agent; Calculating based on the actor key vector of the target intelligent agent and the actor query vector of