CN-121981197-A - Multi-agent rewarding function automatic generation and optimization method and system

CN121981197ACN 121981197 ACN121981197 ACN 121981197ACN-121981197-A

Abstract

The invention provides a method and a system for automatically generating and optimizing multi-agent rewarding functions, and relates to the technical field of artificial intelligence and reinforcement learning. The method comprises the steps of extracting state description information of an environment through an environment context construction module, generating a reward function code by utilizing a large language model, simultaneously evaluating a plurality of reward function candidates through a parallelized reward function evaluation module, converting statistical information in a training process into structured natural language feedback through a thinking report generation module, enabling the large language model to understand training dynamics and pertinently improve the reward function, and finally generating the reward function which meets a task target and has good training characteristics through an iterative optimization process. The invention combines the code generation capability of the large language model with the reinforcement learning training process, thereby realizing the automatic generation and iterative optimization of the reward function.

Inventors

ZHANG LIHUA
LI XIU
YANG QINGZE

Assignees

清华大学深圳国际研究生院

Dates

Publication Date: 20260505
Application Date: 20260212

Claims (10)

1. The method for automatically generating and optimizing the multi-agent rewarding function is characterized by comprising the following steps of: s1, constructing an environment context, wherein the environment context comprises the steps of acquiring and processing program codes or state description information of a target multi-agent reinforcement learning environment to form context information containing observable and interactable variables; s2, generating a reward function, namely inputting the context information and task target description information into a large language model, and generating a reward function code by the large language model; s3, evaluating the reward function, namely performing reinforcement learning training by using the reward function codes in the target multi-agent reinforcement learning environment, and performing performance evaluation on the trained reward function codes based on a preset evaluation index to obtain an evaluation result; S4, generating a returnal report, namely generating a returnal report of the structured natural language according to the evaluation result and the statistical information of each component part of the reward function in the training process; And S5, context updating and iteration, wherein the context updating and iteration comprises the steps of feeding back the thinking-back report to the large language model, forming new input together with the context information to guide the large language model to generate improved reward function codes in subsequent iteration, and repeatedly executing the steps S2-S5 until the termination condition is met.
2. The multi-agent rewards function automatic generation and optimization method of claim 1 wherein in step S2, the large language model is a large language model capable of generating executable program code.
3. The multi-agent rewards function automatic generating and optimizing method according to claim 1, wherein in step S3, performing performance evaluation on the trained rewards function codes based on a preset evaluation index includes performing parallel calculation using a graphic processor to evaluate a plurality of the rewards function codes simultaneously.
4. The method for automatically generating and optimizing a multi-agent reward function according to claim 1, wherein in step S4, the statistical information includes a sequence of values and statistics of each component of the reward function at different time points in the training process.
5. The method according to claim 1, wherein in step S5, the termination condition is that a preset number of iterations is reached or the evaluation result is better than a preset threshold.
6. The multi-agent rewards function automatic generation and optimization method of claim 1 further comprising the steps of: and S6, human feedback integration, namely receiving and processing natural language feedback information input by human beings, and integrating the feedback information into the back thought report or independently inputting the feedback information into the large language model.
7. The multi-agent rewards function automatic generation and optimization method of claim 1 wherein in step S2 said rewards function code includes a local rewards component for a single agent and a global rewards component for a team of agents.
8. The multi-agent rewards function automatic generation and optimization method of claim 7 wherein in step S3 said rewards function evaluation includes evaluating trained said rewards function code using a pool of opponents that contain self-challenge strategies to test its robustness.
9. A system for implementing the multi-agent rewards function auto-generation and optimization method of any of claims 1 to 8, comprising: an environmental context construction module comprising an interface to communicate with a large language model service, the environmental context construction module configured to perform the environmental context construction step; A reward function generation module comprising a large language model configured to perform the reward function generation step; A reward function assessment module comprising a graphics processor configured to perform the reward function assessment step; The device comprises a back thought report generation module, a back thought report generation module and a back thought report generation module, wherein the back thought report generation module comprises a data statistics device and a natural language template library, the data statistics device is used for counting all components of a reward function in a training process, and the natural language template library is configured to generate a structured natural language back thought report; an iteration control module comprising an interface in communication with a large language model service, the iteration control module configured to perform the context updating and iteration steps.
10. The system of claim 9, further comprising a human feedback interface module comprising an interface to communicate with a large language model service, the human feedback interface module to receive and process human input natural language feedback information and to incorporate the feedback information into the jeopardy report or as a separate input to the large language model.

Description

Multi-agent rewarding function automatic generation and optimization method and system Technical Field The invention relates to the technical field of artificial intelligence and reinforcement learning, in particular to a method and a system for automatically generating and optimizing multi-agent rewarding functions based on a large language model. Background In reinforcement learning (Reinforcement Learning, RL) tasks, an appropriate reward function is critical to learning to the desired strategy. The actual task valuation metrics are often sparse or non-smooth (e.g., only return 1 when successful, otherwise 0), and direct use as a learning goal can lead to training difficulties, so engineering often employs "reward shaping (REWARD SHAPING)" or manual trial-and-error designs to accelerate learning. However, recent studies have shown that trial-and-error, artificially-called reward functions are prone to overfitting to specific algorithms/hyper-parameters and may constitute invalid or erroneous task specifications (i.e., rewards do not reflect the actual intent of the designer), resulting in unpredictable or irreproducible learning results and unfair comparative evaluation problems. Traditional methods of automatic generation of reward functions (e.g., templating/parametric searching) are limited by predefined reward templates and expressivity, and are difficult to cover complex behaviors or high-dimensional manipulations. Prior art techniques such as EUREKA (an automated rewards design framework with large language models as a core) can automatically generate and iteratively improve a white-box rewards program by performing evolutionary searches and rewards disbeliefs using environmental source code as context. However, the specific application and systematic realization of the framework in the multi-agent scene are still to be perfected, and particularly the framework has defects in the aspects of processing multi-agent cooperation, competition, human feedback integration and the like Disclosure of Invention In view of the above, the invention provides a method and a system for automatically generating and optimizing multi-agent rewarding functions, which aims to solve the technical problems of how to provide a general method capable of automatically generating high-quality interpretable rewarding function codes, adaptively improving in training dynamics and simultaneously being compatible with the characteristics of the multi-agent, so as to solve the problems of time consumption and error in manual design, limited expression of the traditional automatic method and insufficient adaptability of the existing scheme in a multi-agent scene. The invention provides an automatic generation and optimization method of a multi-agent rewarding function, which comprises the following steps of S1, constructing an environment context, S2, generating a rewarding function, S3, evaluating the rewarding function, wherein the method comprises the steps of acquiring and processing program codes or state description information of a target multi-agent reinforcement learning environment to form context information containing observable and interactable variables, S2, inputting the context information and task target description information into a large language model to generate a rewarding function code by the large language model, S3, evaluating the rewarding function, namely, performing reinforcement learning training by using the rewarding function code in the target multi-agent reinforcement learning environment and performing performance evaluation on the trained rewarding function code based on a preset evaluation index, and obtaining an evaluation result, S4, generating an anti-thinking report, and S5, updating and iterating the context, namely, feeding the anti-thinking report back to the large language model together with the context information to form a new input, guiding the anti-thinking function to meet the requirement of the subsequent iteration code, and S5, and repeating the steps of generating the rewarding function until the iteration model is finished. Preferably, in step S2, the large language model is a large language model capable of generating executable program code. Preferably, in step S3, the evaluation of the bonus function further includes performing parallel computation using a graphic processor to evaluate a plurality of bonus function codes at the same time. Preferably, in step S4, the statistical information includes the sequence of values and their statistics of the various components of the reward function at different points in time during the training process. Preferably, in step S5, the termination condition is that a preset number of iterations is reached or the evaluation result is better than a preset threshold. Preferably, the method further comprises the step of S6, human feedback integration, wherein the method comprises the steps of receiving and processing natural language feedba