CN-122021886-A - Method for fine-tuning intelligent agent and computing equipment

CN122021886ACN 122021886 ACN122021886 ACN 122021886ACN-122021886-A

Abstract

A method for fine tuning an agent and a computing device acquire an inference result output by the inference agent aiming at a target task, wherein the inference result at least comprises an inference conclusion and a logic path, the logic path shows a logic support relationship among various rounds of inferences executed by the inference agent aiming at the target task in an inference process, a reward score corresponding to the target task is determined according to the consistency of the logic path, parameter adjustment is carried out on a large language model in the inference agent through a reinforcement learning method according to the reward score, the logic consistency of the inference agent when the inference agent faces a complex task can be improved, and the reliability of the output result of the inference agent is improved.

Inventors

DENG YONG
Ying Chenzhe
MENG CHANGHUA
WANG WEIQIANG

Assignees

支付宝(杭州)数字服务技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260115

Claims (10)

1. A method of fine tuning an agent, the method comprising: Obtaining an inference result output by an inference agent aiming at a target task, wherein the inference result at least comprises an inference conclusion and a logic path, and the logic path shows a logic support relationship among various rounds of inferences executed by the inference agent aiming at the target task in an inference process executed by the inference agent; Determining a reward score corresponding to the target task according to the continuity of the logic path; And according to the reward scores, parameter adjustment is carried out on the large language model in the reasoning intelligent agent through a reinforcement learning method.
2. The method of claim 1, obtaining an inference result output by an inference agent for a target task, specifically comprising: Inputting the target task into the reasoning intelligent agent, so that the reasoning intelligent agent iteratively executes multiple rounds of reasoning according to the target task and determines the logic path according to the multiple rounds of reasoning, wherein each round of reasoning comprises executing a reasoning action and acquiring a corresponding reasoning result, the reasoning action at least comprises an evidence acquisition action and a conclusion deduction action, the execution result corresponding to the evidence acquisition action is evidence information, the execution result corresponding to the conclusion deduction action is a sub-conclusion, and the reasoning conclusion is a sub-conclusion corresponding to the final round.
3. The method of claim 2, wherein the logical path is represented as graph structure data showing a number of nodes and directed edges between the number of nodes, the nodes including acquisition nodes representing evidence acquisition actions, evidence nodes representing evidence information, and conclusion nodes representing sub-conclusions, edges between nodes showing logical support relationships between corresponding nodes.
4. A method according to claim 3, the method involving an assessment model, the assessment model being a large language model; Determining the reward score corresponding to the target task according to the continuity of the logic path specifically comprises: Inputting the logic path into the evaluation model, and determining a consistency score of the evaluation model for the logic path; And determining the reward score corresponding to the target task according to the consistency score.
5. The method of claim 4, inputting the logical path into the evaluation model, determining a continuity score of the evaluation model for the logical path, comprising: Inputting the logic path and a preset grading prompt word into the evaluation model, determining a consistency grade of the evaluation model for the logic path, wherein the grading prompt word shows a plurality of grading dimensions, and the grading dimensions at least comprise an evidence coverage dimension, an reasoning consistency dimension, an evidence matching dimension and an reasoning efficiency dimension, wherein the evidence coverage dimension measures the connection condition of conclusion nodes and evidence nodes, the reasoning consistency dimension measures the logical consistency between the interconnected conclusion nodes, the evidence matching dimension measures the semantic matching degree between the conclusion nodes and the connected evidence nodes, and the reasoning efficiency dimension measures the connection condition of acquisition nodes and evidence nodes.
6. The method of claim 4, wherein the reasoning action corresponding to the first round of reasoning further comprises a task decomposition action, wherein the execution result corresponding to the task decomposition action is a plurality of sub-questions obtained by splitting according to the target task; the nodes also include problem nodes representing sub-problems.
7. The method of claim 1, wherein determining a reward score corresponding to the target task based on the consistency score comprises: inputting the reasoning conclusions into the evaluation model, and determining the result scores of the evaluation model for the reasoning conclusions; and determining the reward score corresponding to the target task according to the consistency score and the result score.
8. The method of claim 3, wherein the parameter adjustment of the large language model in the reasoning agent is performed by a reinforcement learning method according to the reward score, specifically comprising: according to the connection relation among the nodes in the logic path, decomposing the rewarding score into sub rewarding scores corresponding to each node; Determining the sub-rewards corresponding to each turn according to the sub-rewards corresponding to each node; And according to the sub-rewarding scores corresponding to each turn, carrying out parameter adjustment on the large language model in the reasoning intelligent agent by a reinforcement learning method.
9. The method of claim 8, wherein the parameter adjustment of the large language model in the inference agent is performed by a reinforcement learning method according to the reward score, specifically comprising: Determining actual return according to the reward score and a preset cost constraint item, wherein the cost constraint item is determined according to the execution times of the evidence acquisition action in the reasoning process and a preset execution times threshold; the actual return is brought into a pre-constructed expected return function, and the gradient of the expected return function is estimated by utilizing Monte Carlo analog sampling, wherein the expected return function represents an expected value of the obtained return under different model parameters; And carrying out parameter adjustment on the large language model in the reasoning intelligent agent according to the determined gradient.
10. A computing device comprising a memory having executable code stored therein and a processor, which when executing the executable code, implements the method of any of claims 1-9.

Description

Method for fine-tuning intelligent agent and computing equipment Technical Field The embodiment of the specification belongs to the technical field of data processing, and particularly relates to a method for fine tuning an intelligent agent and computing equipment. Background The deep research Agent (DEEP RESEARCH AGENT, DR Agent) is an intelligent Agent system capable of autonomously completing the whole flow of 'providing hypothesis → collecting evidence → repeatedly verifying and outputting conclusion' in the long-period, multi-step and cross-source information retrieval-reasoning-writing tasks like human researchers. The deep research agency is put forward, a large language model is converted from an auxiliary tool of a research process to an end-to-end problem processing expert, and a user can directly acquire a complete task report by describing a complex data analysis task (such as development current analysis of a certain industry, market prospect analysis of a certain emerging technology and the like) to the deep research agency, so that the user does not need to participate in the analysis process. However, at present, when facing to the complex task requiring long-line reasoning, the deep research proxy has the problems of mismatching of conclusion and evidence, even no midwifery of braiding conclusion and the like. This also results in a great deal of discounts in the trustworthiness of the output results of the depth research proxy. Therefore, the present embodiment provides a technical solution for fine tuning an agent to at least partially solve the above-mentioned problems. Disclosure of Invention Embodiments of the present disclosure are directed to a method of fine tuning an agent and a computing device, including: A first aspect of the present description provides a method of trimming an agent, the method comprising: Obtaining an inference result output by an inference agent aiming at a target task, wherein the inference result at least comprises an inference conclusion and a logic path, and the logic path shows a logic support relationship among various rounds of inferences executed by the inference agent aiming at the target task in an inference process executed by the inference agent; Determining a reward score corresponding to the target task according to the continuity of the logic path; And according to the reward scores, parameter adjustment is carried out on the large language model in the reasoning intelligent agent through a reinforcement learning method. A second aspect of the present description provides a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method according to the first aspect. A third aspect of the present description provides a computing device comprising a memory having executable code stored therein and a processor, which when executing the executable code, implements the method according to the first aspect. The scheme of fine tuning the intelligent agent provided by the embodiment can promote the logical consistency of the reasoning intelligent agent when facing complex tasks, and promote the reliability of the output result of the reasoning intelligent agent. Drawings In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present disclosure, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. FIG. 1 is a schematic diagram of an inference process of an inference agent in an embodiment of the present disclosure; FIG. 2 is a schematic diagram of a node construction method taking one round as an example in an embodiment of the present disclosure; FIG. 3 is a flow chart illustrating a method for fine tuning an agent according to an embodiment of the present disclosure Detailed Description In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure. First, the technical terms referred to in the present specification will be briefly described: The intelligent agent is a software system capable of running autonomously with little manual intervention, and has the core capabilities of