CN-121997969-A - Hierarchical decision method and system for fusion of large language model and reinforcement learning

CN121997969ACN 121997969 ACN121997969 ACN 121997969ACN-121997969-A

Abstract

The invention relates to the technical field of multi-agent autonomous decision making, in particular to a hierarchical decision making method and a hierarchical decision making system for fusion of a large language model and reinforcement learning, wherein the method comprises the following steps of constructing a hierarchical decision making frame comprising an upper-layer agent and a bottom-layer execution module; the method comprises the steps of designing a bottom execution module of a layered decision framework based on reinforcement learning training, designing an upper-layer agent of the layered decision framework based on a large language model, designing a prompt instruction optimization iteration mechanism, taking environmental feedback as a signal, realizing continuous evolution of the prompt instruction through self-thinking of the large language model, and introducing a multi-agent sequential collaborative decision mechanism based on a thinking chain to realize explicit reasoning and modeling of a multi-agent collaborative relationship. The method and the system provided by the invention ensure the decision performance of the system and simultaneously obviously improve the decision capability and the cooperative efficiency of the system in a complex scene of multiple intelligent bodies.

Inventors

ZHANG XUEBO
Jian Chenxu
ZHAO MINGHUI
WEI YONGSEN

Assignees

南开大学

Dates

Publication Date: 20260508
Application Date: 20260128

Claims (9)

1. A hierarchical decision method for fusion of a large language model and reinforcement learning is characterized by comprising the following steps: S1, constructing a layered decision framework comprising an upper-layer agent and a bottom-layer execution module; S2, designing a bottom execution module of the hierarchical decision framework based on reinforcement learning training; s3, designing an upper-layer intelligent agent of the hierarchical decision framework based on a large language model; S4, outputting a macroscopic plan to a bottom execution module by the upper-layer agent, generating a specific execution action by the bottom execution module according to the macroscopic plan, acting the specific execution action on the environment to obtain feedback of the environment, transmitting the feedback of the environment to a memory feedback optimization module of the large language model, and performing iterative optimization on a system prompt instruction of the large language model by the memory feedback optimization module based on the feedback of the environment to generate an optimal system prompt instruction to obtain the upper-layer agent under the optimal system prompt instruction; S5, a multi-agent sequential collaborative decision module of the large language model prescribes a decision sequence of an upper agent under an optimal system prompt instruction, and generates a decision action of the upper agent under the optimal system prompt instruction according to the decision sequence of the upper agent under the optimal system prompt instruction, then explicit reasoning is carried out based on the decision action of the upper agent under the optimal system prompt instruction, so as to obtain the action and analysis of the upper agent under the optimal system prompt instruction, the action and analysis of the upper agent under the optimal system prompt instruction are input into a bottom execution module as planning information of the upper agent, and the bottom execution module generates a corresponding action to execute.
2. The hierarchical decision method of fusion of large language model and reinforcement learning of claim 1, wherein in step S1, the upper layer agent is responsible for task planning of the hierarchical decision framework, the bottom layer execution module is responsible for executing specific actions, and information interaction and instruction transmission are performed between the upper layer agent and the bottom layer execution module through a structured interface.
3. The hierarchical decision method of fusion of large language model and reinforcement learning of claim 1, wherein the reinforcement learning is performed by adopting a near-end gradient clipping algorithm in step S2, and a bottom execution module of the hierarchical decision framework is designed.
4. The hierarchical decision method combining a large language model and reinforcement learning according to claim 3, wherein step S2 adopts a near-end gradient clipping algorithm to perform reinforcement learning, and when designing a bottom execution module of the hierarchical decision framework, a loss function is calculated according to formula (1): (1); Wherein: Representing a value of the network loss function, Represents the operation of taking the average value, The state of the system is indicated and, Representing system status A function of the value at which the value is to be found, Representing a function of the target value, Representing a policy network loss, The probability factor is represented by a number of terms, The function of the advantage is represented by, Representing the cut-off function, Representing the gradient clipping factor.
5. The hierarchical decision method of fusion of large language model and reinforcement learning of claim 4, wherein in step S4, the memory feedback optimization module performs iterative optimization on system prompt instructions of large language model based on feedback of environment, and comprises two parts of decision storage and anti-thinking iteration, wherein the decision storage part calculates complete decision track of upper-layer agent according to formula (2), the anti-thinking iteration part generates prompt instructions through multiple anti-thinking iteration, and evaluates the prompt instructions according to formula (3), so as to generate optimal prompt instructions, and the evaluation formula is as follows: (2); (3); Wherein: representing the complete decision trajectory of the upper level agent, Representation of The game situation after the time of day is textual, Representation of The time-of-day generation of the upper layer decisions, Representation of The return of the moment of time, Representation of The game situation after the time of day is textual, The length of the track is indicated and, Representing the instruction of the optimal prompt, Indicating finding a hint instruction that maximizes the evaluation function, The evaluation function is represented by a graph of the evaluation function, Represent the first The hint instruction generated by the round of iteration, Representing the total number of iterations.
6. The hierarchical decision method of fusion of large language model and reinforcement learning according to claim 1, wherein in step S5, the agent sequential collaborative decision module performs explicit reasoning according to formula (4) to obtain actions and analysis of the upper agent under the optimal system prompt instruction: (4); Wherein: Represent the first The actions generated by the upper-layer agents, Represent the first Analysis of the generation of the individual upper-level agents, Representing a high-level gaming strategy based on a large language model, The state of the system is indicated and, Representing the actions and analysis sequences of the upper-layer agents of the preamble.
7. The hierarchical decision method of fusion of large language model and reinforcement learning of claim 1, wherein the number of training rounds is 1000 when designing the bottom execution module of the hierarchical decision framework based on reinforcement learning training in step S2.
8. The hierarchical decision method of fusion of a large language model and reinforcement learning of claim 1, wherein the number of iterative optimization rounds of system prompt instructions of the large language model based on feedback of environment in step S4 is 100.
9. A hierarchical decision system for fusing a large language model and reinforcement learning, for executing a hierarchical decision method for fusing a large language model and reinforcement learning according to any one of claims 1 to 8, comprising a hierarchical decision framework, a bottom execution module design unit, a large language model and an upper-layer intelligent body design unit; The hierarchical decision framework comprises an upper-layer intelligent body and a bottom-layer execution module, wherein the upper-layer intelligent body and the bottom-layer execution module are connected through a structured interface, the upper-layer intelligent body is responsible for task planning of the hierarchical decision framework, and the bottom-layer execution module is used for executing specific actions; The bottom execution module design unit designs a bottom execution module based on reinforcement learning training; the upper-layer intelligent body design unit designs an upper-layer intelligent body based on a large language model to generate a task plan of a layered decision frame; the large language model comprises a memory feedback optimization module and a multi-agent sequential collaborative decision module, wherein the memory feedback optimization module carries out iterative optimization on system prompt instructions of the large language model based on feedback of environments to generate optimal system prompt instructions, the multi-agent sequential collaborative decision module prescribes decision sequences of upper-layer agents under the optimal system prompt instructions, decision actions of the upper-layer agents under the optimal system prompt instructions are generated according to the decision sequences of the upper-layer agents under the optimal system prompt instructions, and then explicit reasoning is carried out based on the decision actions of the upper-layer agents under the optimal system prompt instructions to obtain actions and analysis of the upper-layer agents under the optimal system prompt instructions.

Description

Hierarchical decision method and system for fusion of large language model and reinforcement learning Technical Field The invention relates to the technical field of multi-agent autonomous decision making, in particular to a hierarchical decision making method and system integrating a large language model and reinforcement learning. Background Collaborative decision-making of large language models and reinforcement learning is a key direction for improving cognition and execution capacity of an agent in complex tasks. The large language model is pre-trained based on massive knowledge, has strong semantic understanding and task planning capability, has higher interpretability in a decision process, but has insufficient precision and slow response in real-time control tasks, and can optimize a decision strategy through autonomous interaction with the environment by reinforcement learning, thereby being good at bottom-layer accurate control and facing the problems of large strategy search space, dependence on a large amount of interaction data and poor interpretability of decision logic. The method integrates the large language model and reinforcement learning, can effectively combine the high-level reasoning capability of the former with the bottom-layer execution advantage of the latter, and enhances the interpretability of the system while improving the decision intelligence, so that the method has wide application prospect in opening complex decision tasks. In the past, the research of collaborative decision-making about large language models and reinforcement learning mostly adopts a static combination strategy, the prompting instruction of the large language models often depends on manual design and solidification, and cannot autonomously evolve according to environmental feedback, or the reinforcement learning is utilized to finely tune the large language models so as to output the instruction more conforming to task rewarding signals, but the demand on calculation force is too large and the method is difficult to be applied to actual scenes, and in complex collaborative tasks, most of the method is implicit decision-making, and a transparent and efficient collaborative mechanism is lacked. The relatively loose and static coupling mode combines the reasoning advantages of a large language model and the execution capability of reinforcement learning to a certain extent, but fails to form an organically coordinated decision closed loop, and the limitation of the architecture level makes the traditional system difficult to realize efficient online learning and multi-agent coordination while maintaining the interpretability, so that the practical application effect and reliability in complex decision tasks are severely restricted. Disclosure of Invention The invention aims to solve the technical problem of providing a hierarchical decision method and a hierarchical decision system for fusing a large language model and reinforcement learning, which can obviously improve the decision capability and the cooperative efficiency of the system under a complex multi-intelligence scene while guaranteeing the decision performance of the system. A hierarchical decision method for fusion of a large language model and reinforcement learning comprises the following steps: S1, constructing a layered decision framework comprising an upper-layer agent and a bottom-layer execution module; S2, designing a bottom execution module of the hierarchical decision framework based on reinforcement learning training; s3, designing an upper-layer intelligent agent of the hierarchical decision framework based on a large language model; S4, outputting a macroscopic plan to a bottom execution module by the upper-layer agent, generating a specific execution action by the bottom execution module according to the macroscopic plan, acting the specific execution action on the environment to obtain feedback of the environment, transmitting the feedback of the environment to a memory feedback optimization module of the large language model, and performing iterative optimization on a system prompt instruction of the large language model by the memory feedback optimization module based on the feedback of the environment to generate an optimal system prompt instruction to obtain the upper-layer agent under the optimal system prompt instruction; S5, a multi-agent sequential collaborative decision module of the large language model prescribes a decision sequence of an upper agent under an optimal system prompt instruction, and generates a decision action of the upper agent under the optimal system prompt instruction according to the decision sequence of the upper agent under the optimal system prompt instruction, then explicit reasoning is carried out based on the decision action of the upper agent under the optimal system prompt instruction, so as to obtain the action and analysis of the upper agent under the optimal system prompt instruction, the a