CN-122021783-A - Multi-agent learning method, device and product for dynamic game environment

CN122021783ACN 122021783 ACN122021783 ACN 122021783ACN-122021783-A

Abstract

The invention discloses a multi-agent learning method, device and product aiming at a dynamic game environment, which relate to the field of artificial intelligence and are used for improving the accuracy and reliability of decision making. The method comprises the steps of obtaining multi-agent historical track data from a dynamic game environment, constructing a historical track sequence of a fixed time window, constructing a transducer decoder adopting a space-time layered attention mechanism, and training the transducer decoder by two-stage training with the aim of minimizing joint action prediction loss until a training stopping condition is reached. The invention avoids the non-stationary problem of training in the multi-agent dynamic game environment, improves the long-term decision capability of the model, approaches the pareto optimum of the decision result, does not need to change the model architecture in the face of different tasks, and has strong generalization capability.

Inventors

HUANG JIANQIANG
ZHANG NANXIN
WU HUAIGU
LI XIN

Assignees

天府绛溪实验室

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (10)

1. A multi-agent learning method for a dynamic gaming environment, comprising: S1, acquiring multi-agent historical track data from a dynamic game environment, and constructing a historical track sequence of a fixed time window, wherein the input representation of the historical track sequence at each time step t Joint observation states each containing multiple agents Combined action And target rewards ; S2, constructing a transducer decoder adopting a space-time layered attention mechanism, wherein the transducer decoder takes the historical track sequence as input and takes the joint action at the next moment as output; S3, training the transducer decoder with the aim of minimizing the joint motion prediction loss until reaching a training stopping condition.
2. The multi-agent learning method for a dynamic gaming environment of claim 1 wherein said constructing a fransformer decoder employing a spatiotemporal hierarchical attention mechanism comprises: A transducer decoder is constructed that includes a temporal attention sub-layer and a spatial attention sub-layer in tandem.
3. The multi-agent learning method for a dynamic gaming environment of claim 2, wherein the spatial attention sub-layer is configured to: input sequence for time step t L is the transducer block index, T is the number of time steps, Resolving the hidden space dimension into individual characterization of each agent, wherein the individual characterization of the ith agent in time step t is represented, and N is the total amount of agents; calculating the attention weight between each agent i and the other agent j at time step t : ; Wherein, the Representing a query of agent i at time step t; a key representing agent j at time step t; is the dimension of the bond; the individual characterization of each agent is weighted and summed based on the attention weight to obtain the enhanced characterization of each agent , A value representing the time step t of agent i; Is a learnable matrix.
4. The multi-agent learning method for a dynamic gaming environment of claim 1, wherein the method of training the fransformer decoder comprises: Training the converter decoder in two stages, wherein the converter decoder is pre-trained offline in a first stage by using a historical track sequence, and the converter decoder is fine-tuned in a second stage by using a predicted track sequence which is successfully predicted online by the converter decoder.
5. The multi-agent learning method for a dynamic gaming environment of claim 4 wherein during training of the transducer decoder, a sequence of trajectories is extracted from an empirical playback pool, wherein in a first phase the empirical playback pool contains only the historical trajectory sequence and in a second phase the empirical playback pool contains the historical trajectory sequence and the predicted trajectory sequence.
6. The multi-agent learning method for a dynamic gaming environment of claim 5 wherein for a historical track sequence, at each time step, the combined actions of the next time step t+1 predicted by the fransformer decoder are performed in a first stage of training the fransformer decoder Decomposing and distributing to each agent for execution, and obtaining the joint observation state of the dynamic game environment response at the next time step t+1 And team rewards According to Constructing an input representation of a next time step to predict a joint action of a further next time step t+2 。
7. The multi-agent learning method for a dynamic gaming environment of claim 5 wherein, in the second stage of training the fransformer decoder, for a true dynamic gaming environment, at each time step t, the combined actions of the next time step t+1 predicted by the fransformer decoder are performed Decomposing and distributing to each agent for execution, and obtaining the joint observation state of the dynamic game environment response at the next time step t+1 And team rewards Will predict success The empirical playback pool is written to construct the predicted track sequence.
8. The multi-agent learning method for a dynamic gaming environment of any of claims 5-7 wherein the experience playback pool hierarchically stores the historical track sequence and the predicted track sequence.
9. A multi-agent learning device for a dynamic game environment, comprising a processor and a storage medium, wherein the storage medium stores computer instructions, and the processor executes the computer instructions to perform the multi-agent learning method for a dynamic game environment according to any one of claims 1 to 8.
10. A computer program product comprising a computer program, the computer program being executable by a processor to perform the multi-agent learning method for a dynamic gaming environment as claimed in any one of claims 1-8.

Description

Multi-agent learning method, device and product for dynamic game environment Technical Field The invention relates to the technical field of artificial intelligence, in particular to a multi-agent learning method, device and product aiming at a dynamic game environment. Background Multi-agent collaboration and competitive gaming strategy learning are core challenges in the field of reinforcement learning and multi-agent systems. The prior art solutions focus mainly on Multi-agent reinforcement learning (Multi-Agent Reinforcement Learning, MARL) methods based on value functions and Multi-agent reinforcement learning methods based on strategy gradients. The value function-based methods can be broadly divided into two categories, one is a value decomposition method such as value decomposition network VDN (Value Decomposition Networks), value function decomposition algorithm QMIX (Q-decomposition Multi-AGENT INDEPENDENT eXtension) that learns by decomposing the team's overall joint action value function into individual value functions of individual agents, and another is a centralized criticizing method such as Multi-agent depth deterministic strategy Gradient algorithm MADDPG (Multi-AGENT DEEP DETERMINISTIC Policy Gradient) that uses a centralized criticizing network to guide scattered strategy execution during training. Policy gradient-based methods, such as the inverse Multi-agent policy gradient algorithm COMA (Counterfactual Multi-Agent Policy Gradients), the Multi-agent near-end policy optimization algorithm MAPPO (Multi-Agent Proximal Policy Optimization), learn the policy by directly optimizing the policy parameters and dealing with the Multi-agent credit allocation problem. However, when these prior arts are applied to dynamic gaming environments where agents need to make long-term planning, handle complex synergies and competing relationships (e.g., automated driving synergies and gaming, robot cluster collaboration, real-time strategic game agent development, etc.), the following specific drawbacks are exposed: 1. Environmental instability and training instability-under the traditional decentralized execution framework, each agent independently updates its policy, and from the perspective of a single agent, the policy updates of other agents result in environmental instability. This directly violates the basic assumption of traditional reinforcement learning about environmental stationarity, resulting in inaccurate Q-value estimation and difficult convergence of strategic training process oscillations. The research finds that the root of the method is that the existing method lacks of explicit modeling and smoothing treatment on the dynamic interaction relation between the intelligent agents. 2. Credit allocation and long-term dependency modeling are difficult, namely, in sparse rewards or long-period tasks, the real contribution of single agent actions to long-term rewards of teams is difficult to accurately evaluate by a value decomposition-based method. Such methods typically rely on decomposition of the immediate rewards, or consider only limited timing dependencies, failing to effectively capture complex causal chains spanning multiple time steps. This makes agent strategies prone to short-term behavior, and it is difficult to learn high-level strategies that require long-term collaborative coordination or strategic fraud. 3. Policy representation and generalization capabilities are limited-existing methods mostly rely on task-specific cost functions or policy network parameterization representations, which have limited policy characterization capabilities. When the strategy of the game opponent changes or the task scene is slightly adjusted, the trained strategy network cannot be quickly adapted, and poor generalization is shown. This is because its learning paradigm relies heavily on implicit estimation of the dynamics of a particular environment, rather than explicit generation of optimal behavior sequences, resulting in policies that are "fragile" in nature and overfitted. These specific technical bottlenecks severely limit the ability of multi-agent systems to learn high performance, high robustness strategies in complex, dynamic gaming environments. Disclosure of Invention Aiming at all or part of the problems, the invention aims to provide a multi-agent learning method, device and product aiming at a dynamic game environment, which are used for solving the problems that the traditional multi-agent reinforcement learning is unstable in training and difficult to converge when facing the dynamic game environment, so that a decision result is inaccurate and unreliable, and improving the long-term decision and interaction capability of the multi-agent in the dynamic game environment. The technical scheme adopted by the invention is as follows: in a first aspect, the present application provides a multi-agent learning method for a dynamic gaming environment, comprising: S1, acquiring multi-agen