CN-121998099-A - Method and computing device for performing agent reasoning based on skill library

CN121998099ACN 121998099 ACN121998099 ACN 121998099ACN-121998099-A

Abstract

A method and computing device for agent reasoning based on a skill library, the skill library comprising a plurality of planning skills and a plurality of functional skills, each of the planning skills comprising a plurality of subtasks corresponding to a class of tasks, each of the functional skills corresponding to a class of subtasks and comprising skill description information and tool invocation information corresponding to a plurality of tools, the method comprising retrieving, for a first task to be inferred, a first planning skill from the skill library, the first planning skill comprising a plurality of first subtasks; retrieving a plurality of first functional skills from the skill base based on the plurality of first subtasks; and reasoning based on the first task and the plurality of first functional skills by the intelligent agent to obtain a first track.

Inventors

WANG CHENXI
Yu Zhuoyun
XIE XIN
YAO WUGUANNAN
ZHANG NINGYU
QI XIANG
ZHANG PENG

Assignees

蚂蚁区块链科技(上海)有限公司

Dates

Publication Date: 20260508
Application Date: 20260214

Claims (10)

1. A method of performing agent reasoning based on a skill base, the skill base comprising a plurality of planning skills and a plurality of functional skills, each of the planning skills comprising a plurality of subtasks corresponding to a class of tasks, each of the functional skills corresponding to a class of subtasks and comprising skill description information and tool invocation information corresponding to a plurality of tools, the method comprising: Retrieving a first planning skill from the skill base for a first task to be inferred, the first planning skill comprising a plurality of first subtasks; retrieving a plurality of first functional skills from the skill base based on the plurality of first subtasks; and reasoning based on the first task and the plurality of first functional skills by the intelligent agent to obtain a first track.
2. The method of claim 1, the retrieving a plurality of first functional skills from the skill base based on the plurality of first subtasks, comprising: rewriting the plurality of first subtasks based on the first task to obtain a plurality of second subtasks; Retrieving the plurality of first functional skills from the skill base based on the similarity of each of the second subtasks to each of the functional skills included in the skill base.
3. The method of claim 1, the skills library further comprising a plurality of atomic skills, each of the atomic skills comprising skill description information and tool invocation information corresponding to a single tool, the method further comprising: Retrieving a plurality of first atomic skills from the skill base based on the similarity of each of the second subtasks to each of the atomic skills; the reasoning by the agent based on the first task and the plurality of first functional skills includes: reasoning is performed by the agent based on the first task, the plurality of first functional skills, and the plurality of first atomic skills.
4. A method according to claim 3, further comprising: Extracting, by the agent, a second planning skill from the first trajectory, the second planning skill comprising a plurality of third subtasks; Extracting, by the agent, a plurality of second functional skills based on the plurality of third subtasks; The skill library is updated based on the second planning skill and the plurality of second functional skills.
5. The method of claim 4, wherein the reasoning, by the agent, based on the first task and the plurality of first functional skills, results in a first trajectory, comprising: carrying out multiple reasoning on the basis of the first task and the plurality of first functional skills by the intelligent agent to obtain a plurality of first tracks; the extracting, by the agent, a second planning skill from the first trajectory, comprising: and respectively extracting a plurality of second planning skills from the plurality of first tracks by the intelligent agent.
6. The method of claim 4, the updating the skill library based on the second planning skill and the plurality of second functional skills, comprising: clustering the plurality of second functional skills to obtain a plurality of clusters; aggregating the second functional skills in each cluster by the agent to obtain a plurality of third functional skills; updating the skill library based on the plurality of third functional skills.
7. The method of claim 6, the updating the skill library based on the second planning skill and the plurality of second functional skills, comprising: Filtering the plurality of third functional skills based on any of a function definition, whether hard coded values are included, whether external code is relied upon, a packaging format, tool call information; updating the skill library based on the filtered plurality of fourth functional skills.
8. The method of claim 7, wherein updating the skill base based on the filtered plurality of fourth functional skills comprises any of modifying functional skills in a skill base based on the plurality of fourth functional skills, adding new functional skills in the skill base based on the plurality of fourth functional skills, and leaving functional skills in the skill base unchanged.
9. The method of claim 1, further comprising: Determining a target tool according to historical call data of the intelligent agent on the tool, wherein the failure rate of the target tool is higher than a preset threshold value, or the call times of the target tool is lower than the preset threshold value; Constructing an exploration track based on the target tool; Generating, by the agent, a second task based on the exploration trajectory, the second task being used to update the skill library.
10. A computing device comprising a memory having executable code stored therein and a processor, which when executing the executable code, implements the method of any of claims 1-9.

Description

Method and computing device for performing agent reasoning based on skill library Technical Field The embodiment of the specification belongs to the technical field of large models, and particularly relates to a method and computing equipment for carrying out agent reasoning based on a skill base. Background The large language model (Large Language Models, LLM) may also be referred to simply as the large model. The large language model is a natural language processing model based on the deep learning technology, and the parameter magnitude of the large language model is usually billions to billions or more, so that the large language model has strong language understanding and generating capability. The large language model may employ a transducer architecture or variants thereof (e.g., GPT, BERT, etc.), which uses an attention mechanism (Attention Mechanism) to implement global modeling of sequence data, and which is capable of efficiently handling long-range dependencies, and thus is excellent in natural language tasks. The large language model learns the statistical characteristics and semantic relevance of the language by pre-training on a large-scale corpus, so that the large language model has generalization capability. The core capabilities of large language models include, but are not limited to, understanding context semantics, generating coherent and grammatically correct text, performing logical reasoning, and processing multitasking scenarios. The methods of use generally include two modes, direct reasoning (INFERENCE) and Fine-tuning (Fine-tuning). In the direct inference mode, the user directs a large language model to generate a specific output by designing a Prompt (Prompt). The hint words can be task descriptions or instructions in text form that are used to motivate semantic understanding and generating capabilities of the large language model. In the fine-tuning mode, the large language model is further trained on small-scale datasets of a particular domain to optimize its performance on a particular task. The powerful generalization capability and flexibility of the large language model make the large language model an important tool in the technical field of artificial intelligence, and an efficient and accurate solution is provided for automatic text generation and understanding. In some embodiments, the large language model may also have understanding and generating capabilities for other modalities (e.g., visual, audio, etc.) of data, in which case the large language model may also be referred to as a multi-modality large language model (Multimodal Large Language Models, MLLMs). MLLMs provide a richer and natural interactive experience by integrating multiple types of inputs and outputs of text, images, sounds, etc. MLLMs have the core advantage that they can process and understand information from different modalities and fuse this information to accomplish complex tasks. For example, MLLMs may analyze a picture and generate descriptive text, or generate a corresponding image from the text description. The cross-modal understanding and generating capability ensures that MLLMs has wide application prospect in a plurality of fields. It should be noted that, the key technology of the large language model can be found in paper A Survey of Large Language Models (paper number: arXiv:2303.18223v16, disclosure time: 2025, 3, 11 days, and the description is omitted here. Agent is a concept in artificial intelligence and computer science, generally referring to an autonomous system or entity that is able to sense its environment, make decisions, and take actions to achieve a certain goal. The agent may be a software program, a robot, or other intelligent system. They accomplish specific tasks or goals through perception of the environment and interaction with the environment. The intelligent agent can acquire environmental information through tools such as a sensor, a camera, a data interface and the like so as to understand the current environmental condition, make decisions through an algorithm or a rule system so as to decide what action to take in a specific environment to finish a target, and execute corresponding operations through an actuator or an output interface. The agent may output actions based on observations/states, for example, based on a large language model (Large Language Model, LLM) as a policy model. An Agent may call an external tool (e.g., a banking API, etc.) based on a model context protocol (Model Context Protocol, MCP). Wherein, the MCP provides a unified and machine-readable description mode for all external tools, namely provides interaction specification of the agent and external services in LLM application. The MCP architecture comprises a MCP host, a MCP client and a MCP server. The MCP host, typically referred to as an AI application, is the initiator of the interaction. The MCP client is located in the MCP host and used for discovering services provided by th