CN-121981149-A - Training method, strategy generation method, device, equipment and medium for multi-agent strategy network

CN121981149ACN 121981149 ACN121981149 ACN 121981149ACN-121981149-A

Abstract

The application provides a training method, a strategy generation method, a device, equipment and a medium of a multi-agent strategy network, which comprise the steps of collecting local observation states, global observation states and rewards for executing action strategies of agents in the execution process of action strategies generated by a current updated strategy network, determining target obstacle function values corresponding to safety evaluation networks based on the local observation states and a pre-built random discrete graph control barrier function, determining safety constraint conditions of the agents based on obstacle function approximation values output by the safety evaluation networks, determining advantage values of actions of the agents based on state value prediction values output by a task value evaluation network and collected rewards, and guiding the strategy network to update parameters by taking the advantage values as updating direction signals under the constraint of the safety constraint conditions. The application reduces the training oscillation or divergence phenomenon caused by the safe signal distortion, and realizes smoother and more stable convergence process.

Inventors

LIU JUNWEI
XU JIANGTAO
ZHANG WEI

Assignees

南方科技大学

Dates

Publication Date: 20260505
Application Date: 20260403

Claims (10)

1. A training method of a multi-agent strategy network is characterized by adopting a pre-built reinforcement learning framework to conduct iterative training until a preset iteration stopping condition is met, wherein the reinforcement learning framework comprises one strategy network, one task value evaluation network and a plurality of safety evaluation networks, and each training process comprises the following steps: Generating action strategies of all the intelligent agents based on the current updated strategy network, controlling all the intelligent agents to execute target tasks according to the action strategies, and collecting local observation states and global observation states of all the intelligent agents in the execution process and rewards for executing the action strategies; determining objective barrier function values corresponding to each safety evaluation network based on the local observation state and the pre-constructed random discrete graph control barrier function, wherein the objective barrier function values represent maximum constraint function values possibly occurring in the future from the current state; the local observation states are respectively input into the plurality of safety evaluation networks to obtain an approximate value of an obstacle function output by each safety evaluation network, and parameters of each safety evaluation network are updated based on the loss between the approximate value of the obstacle function and the objective obstacle function value; Inputting the global observation state into the task value evaluation network to obtain a state value predicted value corresponding to the global observation state under the current strategy, and determining the dominant value of each agent action based on the state value predicted value and the collected rewards; under the constraint of the safety constraint condition, the dominant value is used as an updating direction signal to guide the strategy network to update parameters.
2. The method of claim 1, wherein the determining the objective obstacle function value for each safety evaluation network based on the local observation state and the pre-constructed random discrete map control barrier function comprises: For each safety evaluation network, in each training iteration, acquiring constraint function values actually observed in a plurality of historical time steps from the current moment, and determining the maximum value of the constraint function values as the maximum constraint function value actually observed based on a pre-constructed random discrete graph control barrier function; meanwhile, inputting the local observation state acquired at the current moment into a corresponding safety evaluation network to obtain an approximate value of the obstacle function of the next moment predicted by the safety evaluation network; based on the maximum constraint function value observed in practice and the obstacle function approximation, carrying out weighted fusion by using a preset obstacle function discount factor to obtain a discount estimation result; determining the larger one of the actually observed maximum constraint function value and the discount estimation result as a target obstacle function value corresponding to the safety evaluation network; And when the first training iteration is performed, the safety evaluation network is not trained, and the maximum constraint function value actually observed is directly used as the objective obstacle function value.
3. The method of claim 1, wherein determining the security constraints for each agent based on the barrier function approximations output by each security assessment network comprises: Aiming at each intelligent agent, obtaining obstacle function approximation values respectively output by a plurality of corresponding safety evaluation networks; Determining respective corresponding security constraint items based on the barrier function approximation values output by the security evaluation networks, wherein the security constraint items reflect constraint violation degrees under corresponding security dimensions; and selecting one item with the largest numerical value from all safety constraint items, and determining the safety constraint condition of the intelligent body in the current training iteration, wherein the safety constraint condition represents the current most urgent safety risk.
4. The method according to claim 1, wherein said directing the policy network to perform parameter updates under the constraint of the security constraint with the dominance value as an update direction signal comprises: determining an objective dominance function based on the safety constraint condition and a preset soft boundary, wherein the soft boundary is a safety tolerance threshold value and is used for distinguishing a safety region from a risk region; Calculating the dominance value of each agent action based on the target dominance function; And carrying out gradient update on the parameters of the strategy network based on the dominance value and a preset loss objective function.
5. The method of claim 4, wherein the soft boundaries comprise upper and lower boundaries, and wherein determining the objective dominance function based on the safety constraints and a preset soft boundary comprises: dividing the safety state of the intelligent agent into three continuous safety intervals, namely a safety interval, a transition interval and a risk interval according to the lower boundary and the upper boundary; when the safety constraint condition is smaller than or equal to the lower boundary, judging that the intelligent agent is in the safety interval, wherein in the safety interval, a generalized advantage function is adopted by a target advantage function so as to guide a strategy network to concentrate on maximizing task performance; When the safety constraint condition is larger than the upper boundary, judging that the intelligent agent is in the risk interval, and adopting the safety constraint condition as a target dominance function in the risk interval to guide the strategy network to concentrate on the safety constraint; And in the transition section, the target dominance function is a weighted combination of the generalized dominance function and the negative punishment term so as to guide the strategy network to balance between task performance optimization and safety constraint satisfaction.
6. The method according to claim 4 or 5, characterized in that the method further comprises: And after each training iteration, updating the soft boundary by adopting an attenuation strategy, wherein the attenuation strategy is to gradually reduce the value of the soft boundary along with the increase of training rounds.
7. A multi-agent execution strategy generation method, which is characterized in that the method is applied to a multi-agent cooperation system, wherein each agent is provided with a strategy network trained by the method of any one of claims 1-6, and the method comprises the following steps: aiming at each intelligent agent, collecting the local observation states of the intelligent agent and the adjacent intelligent agents; and inputting the local observation state into a strategy network deployed by the intelligent agent, and generating an execution strategy for controlling the action of the intelligent agent by the strategy network.
8. A multi-agent execution policy generation device, which is applied to a multi-agent cooperation system, wherein each agent deploys a policy network trained by the method of any one of claims 1-6, and the device comprises: The acquisition unit is used for acquiring the local observation states of the intelligent agent and the neighbor intelligent agents of the intelligent agent aiming at each intelligent agent; and the generation unit is used for inputting the local observation state into a strategy network deployed by the intelligent agent, and generating an execution strategy for controlling the action of the intelligent agent by the strategy network.
9. An electronic device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are in communication with each other through the communication bus; The memory is used for storing a computer program; The processor is configured to implement the method of any one of claims 1-7 when executing a program stored on the memory.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-7.

Description

Training method, strategy generation method, device, equipment and medium for multi-agent strategy network Technical Field The invention relates to the technical field of multi-agent cooperation, in particular to a training method, a strategy generation method, a device, equipment and a medium of a strategy network of a plurality of agents. Background In a multi-agent collaboration scenario (e.g., unmanned aerial vehicle clusters, autopilot fleets, robotic fleets, etc.), each agent needs to collaboratively complete complex tasks in a dynamic, uncertain environment. In recent years, deep reinforcement learning is widely applied to multi-agent policy generation due to its strong policy learning ability. However, the above application scenario generally has extremely high requirements on security, which is a basic premise of reliable deployment and stable operation of the system. Thus, to improve safety, researchers have proposed a number of safety reinforcement learning methods in which control barrier functions (Control Barrier Function, CBF) are of interest because of their ability to provide formal safety guarantees. However, the conventional obstacle function control method is generally based on a single and deterministic path of evolution of the track state for safety verification. While the strategy in deep reinforcement learning is essentially a random strategy, its action selection has inherent uncertainty, and the training process relies on trajectory data randomly sampled from the environment. Such inconsistencies result in biased or even erroneous safe gradient estimates when the gradient signal of the deterministic CBF is directly used to update the stochastic strategy, causing oscillations or divergences in the strategy training process. Disclosure of Invention In view of the above, the present invention aims to provide a training method, a policy generating method, a device, equipment and a medium for a policy network of multiple agents, so as to improve the effectiveness of security constraint in the policy training process and reduce the risk of unstable training. In a first aspect, a training method of a multi-agent strategy network is provided, and iterative training is performed by adopting a pre-built reinforcement learning framework until a preset iteration stopping condition is met, wherein the reinforcement learning framework comprises a strategy network, a task value evaluation network and a plurality of safety evaluation networks, and each training process comprises: Generating action strategies of all the intelligent agents based on the current updated strategy network, controlling all the intelligent agents to execute target tasks according to the action strategies, and collecting local observation states, global observation states and rewards for executing the action strategies of all the intelligent agents in the executing process; Determining objective obstacle function values corresponding to each safety evaluation network based on the local observation state and the pre-constructed random discrete graph control barrier function; Respectively inputting the local observation states into a plurality of safety evaluation networks to obtain an approximate value of an obstacle function output by each safety evaluation network, and updating parameters of each safety evaluation network based on the loss between the approximate value of the obstacle function and the objective obstacle function value; Inputting the global observation state into a task value evaluation network to obtain a state value predicted value corresponding to the global observation state under the current strategy, and determining the dominant value of each agent action based on the state value predicted value and the collected rewards; Under the constraint of the safety constraint condition, the dominant value is used as an updating direction signal to guide the strategy network to update parameters. Optionally, determining the objective obstacle function value corresponding to each security evaluation network based on the local observation state and the pre-constructed random discrete graph control barrier function includes: For each safety evaluation network, in each training iteration, acquiring constraint function values actually observed in a plurality of historical time steps from the current moment, and determining the maximum value of the constraint function values as the maximum constraint function value actually observed based on a pre-constructed random discrete graph control barrier function; meanwhile, inputting the local observation state acquired at the current moment into a corresponding safety evaluation network to obtain an approximate value of the obstacle function of the next moment predicted by the safety evaluation network; Based on the maximum constraint function value and the obstacle function approximation value which are actually observed, carrying out weighted fusion by using a preset ob