CN-122021905-A - Multi-agent reasoning optimization method and system based on Token calculation acceleration
Abstract
The application provides a multi-agent reasoning optimization method and system based on Token calculation acceleration, which relate to the field of intelligent decision, and the method comprises the steps of generating and maintaining a Token-level directed acyclic graph on line; the method comprises the steps of establishing an inference path consistency sharing strategy, an approximate semantic sharing strategy and a forced sharing strategy to perform KV cache sharing, processing local change characteristics by adopting an incremental coding mode to perform context difference detection, incremental coding calculation and KV cache splicing, defining a contribution degree function for each Token to perform Token contribution degree evaluation, introducing a dynamic pruning mechanism based on the contribution degree, and realizing parallel execution without depending on the Token through topology analysis and dynamic scheduling based on Token-level directed acyclic graphs. The method can avoid redundant Token generation, save communication bandwidth and calculation resources of the spacecraft, shorten key path delay, improve decision efficiency of the multi-agent collaboration system, and meet the severe requirements of spacecraft tasks on decision efficiency and decision safety.
Inventors
- BAI XUE
- Jia Xiaoleng
- XU MING
- WANG XIAOYI
- DING JIXIN
- CHEN ZHAOYUE
- QI XUGUANG
Assignees
- 北京航空航天大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260130
Claims (10)
- 1. A Token-based computational acceleration multi-agent reasoning optimization method is characterized by comprising the following steps: based on the task dynamics of the spacecraft and the decision security requirements, generating and maintaining a Token-level directed acyclic graph on line, wherein the Token-level directed acyclic graph is an inference graph and is used for identifying the dependency relationship of Token nodes; constructing an inference path consistency sharing strategy, an approximate semantic sharing strategy and a forced sharing strategy to perform KV cache sharing, realizing inter-intelligent KV cache multiplexing and reducing repetition coding; Processing local change characteristics of spacecraft context data by adopting an incremental coding mode, and performing context difference detection, incremental coding calculation and KV cache splicing so as to avoid full context repetition coding and reduce calculation amount; Aiming at the task characteristics of the spacecraft, a contribution degree function is defined for each Token to evaluate the contribution degree of the Token, and a dynamic pruning mechanism based on the contribution degree is introduced to avoid generating redundant Token in the decision process of the spacecraft; based on the calculation resources of the spacecraft, the Token-level directed acyclic graph is used as a basis, and the parallel execution of the Token-free graph is realized through topology analysis and dynamic scheduling, so that the critical path delay is shortened.
- 2. The Token-based computational acceleration multi-agent inference optimization method of claim 1, wherein the Token-level directed acyclic graph generation process comprises: task type based on deep space exploration Calling a preset spacecraft intelligent role template, carrying out instantiation processing to obtain an intelligent set, and generating an initial prompt word and strategy parameters ; Semantic embedding is carried out on the prompt templates of each intelligent agent to obtain embedded vectors Calculating the prompt template and task type Semantic similarity of (2) ; Based on semantic similarity To generate an initial inference framework to obtain an initial directed graph : , For the initial set of nodes, Is an initial edge set; Creating Token nodes as agents Starting reasoning and generating When it will Adding to a set of nodes And records the corresponding KV cache, wherein, Is an intelligent body The generated t-th Token node; automatically establishing sequence dependent edges according to the generation sequence in the same intelligent agent so as to ensure the generation sequence; If the agent Prompting template of (a) Explicit reference agent At time t Establishing a directed edge crossing the intelligent agent; establishing a background process to calculate any two Token nodes in real time And Semantic similarity of (c): If (if) Two Token nodes are connected And Merging and deleting And The relevant edge of the edge, wherein, Is a preset first adjustable threshold value; And (3) performing loop detection, detecting whether the inference graph is still a directed acyclic graph by using a topology sequencing algorithm, performing topology sequencing on the inference graph, if the sequencing length is smaller than the total number of nodes |P|, then, a loop exists, if the sequencing length is smaller than the total number of nodes |P|, generating a loop, adopting a loop breaking strategy, calculating the contribution degree of each node in the loop, reserving the node path with the highest contribution degree, and deleting other conflict edges to generate the Token-level directed acyclic graph.
- 3. The Token-based computational acceleration multi-agent inference optimization method of claim 2, wherein the semantic embedding of the hint templates for each agent satisfies the expression: , wherein, For the pre-trained embedded layer parameters, Representing an agent Is provided with a prompting template of (a), Representing an agent collection The first of (3) An agent; the semantic similarity The calculation process of (1) satisfies the expression: Wherein, the method comprises the steps of, Representation pair The semantic embedding is carried out, the processing method comprises the steps of, And a deep space detection task instruction sequence with the length of L dire and formed for the d-dimensional vector.
- 4. The method for multi-agent inference optimization based on Token computation acceleration of claim 2, wherein the inference path consistency sharing policy comprises defining nodes Is (1) , Representing from initial node to node Is dependent on the path of the path; If node And node The inference paths in the inference graph are the same: Node if yes Fully multiplexing node KV cache of (2); the approximate semantic sharing strategy comprises the steps of determining semantic vector distances between Token nodes based on Embedding technology, and if the semantic vector distances of two Token nodes meet the following conditions: ; Representing a semantic embedding operation; a second preset adjustable threshold value, a node Multiplexing nodes according to weighted interpolation mode KV buffer: Wherein, the method comprises the steps of, And represents the weight of the semantic similarity, Is a node Is a local KV cache of (1); Representing nodes Is used for the KV cache of the computer system, Representing nodes KV cache of (2); The forced sharing strategy comprises the following steps of if the node at the downstream side Prompting template of (a) Comprising nodes upstream of Explicit placeholder reference of (c), then the node is forced to be established Sum node Dependency between, and cause nodes to Multiplexing node Corresponding KV buffers of (c).
- 5. The method for multi-agent inference optimization based on Token calculation acceleration as set forth in claim 1, wherein the context difference detection comprises comparing current context by spacecraft data time stamp and semantic feature comparison at the time of agent inference Context with the upper wheel Token semantic difference of (2) : ; The incremental encoding calculation includes the steps of based on Token semantic differences Incremental encoding, calling and generating in the process of incremental encoding Time-consistent agent model transducer encoder Parameters and parameters Ensuring consistency of the dimension and semantic space of the generated KV vector; representing a historical Token sequence; The KV cache splicing comprises the step of performing sequence splicing on the KV newly generated by the intelligent agent and the old KV, so that the intermediate state of the complete context can be directly called by subsequent reasoning.
- 6. The Token-based computational acceleration multi-agent inference optimization method of claim 1, wherein the contribution function satisfies the expression: ; Is a node Contribution of (2); 、 、 is a weighting coefficient, and α=0.4, β=0.3, γ=0.3; Contributing attention and representing the attention weight average value of the Token in the subsequent generation process; L is the total number of Token generated, ATTNWEIGHT is self-attention weight and the value range is 0, 1; Is the position of Token in the agent, node Is a node Is the next node of (a); Calculating the similarity of the embedded vector for semantic contribution and representing the association degree of Token and task target; ; For a deep space probe task instruction sequence, Representing a semantic embedding operation; contributing to the task and satisfying the expression: ; e [0.1,1] is the task type weight, Is an indication function; If it is Immediately terminating the subsequent generation path of Token, performing pruning operation while deleting the dependent nodes Wherein, Is a preset third adjustable threshold.
- 7. The Token-based computational acceleration multi-agent inference optimization method of claim 1, wherein the spacecraft-based computational resources are based on the Token-level directed acyclic graph, implement Token-independent parallel execution through topology analysis and dynamic scheduling, and shorten critical path delays, comprising: Defining a reachability function, performing topology analysis on the inference graph, establishing a hierarchy, and judging Token nodes which are positioned in the same topology layer and have no dependency paths between two nodes as parallelizable groups; According to the current computing resource load of the spacecraft, distributing the adaptive computing resources for each parallel group, defining the total contribution degree and the resource distribution proportion of the parallel groups, preferentially distributing GPU resources to the high-priority parallel groups with high overall contribution degree, and ensuring the maximization of resource utilization; when computing resources are limited, a value density of each Token node is determined to select a Token node to execute preferentially based on the value density.
- 8. A Token-based computational acceleration based multi-agent inference optimization system, comprising: The generation and maintenance module is used for generating and maintaining a Token-level directed acyclic graph on line based on the task dynamics of the spacecraft and the decision security requirements, wherein the Token-level directed acyclic graph is an inference graph and is used for identifying the dependency relationship of Token nodes; The policy construction module is used for constructing an inference path consistency sharing policy, an approximate semantic sharing policy and a forced sharing policy so as to carry out KV cache sharing, realize cross-agent KV cache multiplexing and reduce repetition coding; The incremental coding module is used for processing local change characteristics of spacecraft context data by adopting an incremental coding mode and carrying out context difference detection, incremental coding calculation and KV cache splicing so as to avoid full context repetition coding and reduce the calculated amount; The evaluation pruning module is used for defining a contribution degree function for each Token for evaluating the Token contribution degree aiming at the task characteristics of the spacecraft, and introducing a dynamic pruning mechanism based on the contribution degree so as to avoid generating redundant Token in the decision process of the spacecraft; The topology scheduling module is used for realizing parallel execution without depending on Token by topology analysis and dynamic scheduling based on the computation resource of the spacecraft and based on the Token-level directed acyclic graph, and shortening the critical path delay.
- 9. An electronic device comprising a processor, a memory, and a program stored on the memory and executable on the processor, which when executed by the processor, implements a Token-based computational acceleration based multi-agent inference optimization method as claimed in any one of claims 1 to 7.
- 10. A computer readable storage medium, wherein a program or instructions is stored on the computer readable storage medium, which when executed by a processor, implements the Token-based computational acceleration based multi-agent inference optimization method of any one of claims 1 to 7.
Description
Multi-agent reasoning optimization method and system based on Token calculation acceleration Technical Field The application relates to the technical field of intelligent decision making, in particular to a multi-agent reasoning optimization method and system based on Token calculation acceleration. Background With the development of large language model technology, the powerful capability of the large language model technology in complex reasoning, decision and planning tasks is gradually mined, and a multi-agent collaboration system based on a large language model becomes a key technical path for solving the complex engineering problem, and typical multi-agent collaboration systems such as MetaGPT, chatDev and the like appear successively. The system is widely applied to autonomous decision-making systems of spacecrafts, provides technical support for the completion of critical tasks such as task decision-making planning, fault diagnosis and the like of the spacecrafts, and continuously improves the application value and the application requirement in the field of the engineering of the spacecrafts by disassembling the whole complex task into a plurality of subtasks, configuring the intelligent agents with special functions such as planners, executors, auditors, coordinators and the like and realizing the cooperative operation of a plurality of intelligent agents by relying on a natural language dialogue mechanism. In the prior art, a large language model multi-agent collaboration system applied to a spacecraft autonomous decision system mainly adopts a serial collaboration mode of a dialogue level or a task level to carry out work. In the collaboration mode, each agent needs to take complete historical dialogue information as input when executing reasoning tasks, strict message dependency relationship is formed between the upstream agent and the downstream agent, in each reasoning process, the agents need to complete coding of long context information and generate new complete natural language text as output, no Token-level intermediate state sharing mechanism is established between different agents, and simple collaboration scheduling is realized only by means of workflow setting of dialogue or task level. However, the architecture of the large language model multi-agent collaborative system applied to the spacecraft exposes a plurality of defects in practical engineering application, and is difficult to adapt to severe requirements of spacecraft tasks on decision efficiency and resource utilization, particularly, the architecture is characterized in that firstly, a multi-agent highly serial collaborative mode leads to linear accumulation of overall response delay of the system along with the number of agents, the high decision delay is easy to lead the spacecraft to miss optimal occasions of critical tasks such as orbit adjustment and observation, and the success rate of the spacecraft tasks is directly influenced, secondly, each agent carries out complete coding on the highly overlapped long context sequence repeatedly, so that KV cache is repeatedly calculated, the calculated amount is greatly increased, display memory resources are excessively occupied, the situation is contradicted with limited calculation resource allocation of the spacecraft, thirdly, a large amount of semantically similar parts exist in contents generated by different agents, unnecessary chain thinking contents are included, the total Token number is linearly increased along with the number of agents, precious communication bandwidth and calculation resources of the spacecraft are consumed, and real-time decision making capability of the system is weakened, fourthly, the system is only set in a dialogue or task level, the calculation is greatly increased, the calculation efficiency is excessively occupied, the computational efficiency of the cooperation of the spacecraft is difficult to realize, and the situation that the calculation of the cooperation of the system is difficult to be fully shared by the aid of the intermediate level, and the computational system is difficult to realize, and the constraint of the computational efficiency is limited by the cooperation of the computational system. Disclosure of Invention Aiming at the defects of the prior art, the application provides a Token-calculation-acceleration-based multi-agent reasoning optimization method and system, which solve the problems that a large language model multi-agent cooperation system architecture applied to a spacecraft has decision delay, is wasteful in calculation and display memory resources, consumes communication and calculation resources in generating content semantic redundancy, lacks Token-level fine-granularity optimization mechanism, and is difficult to adapt to severe requirements of spacecraft tasks on decision efficiency and resource utilization. In order to achieve the above purpose, the application is realized by the following technical sche