CN-122021694-A - Task processing method, system and training method of tool selector

CN122021694ACN 122021694 ACN122021694 ACN 122021694ACN-122021694-A

Abstract

The invention discloses a task processing method, a task processing system and a training method of a tool selector. The method comprises the steps of receiving target subtasks, wherein the target subtasks are obtained by decomposing a demand task based on a large language model, determining target probability distribution corresponding to a target tool sequence according to the target subtasks and a target tool selector, determining a target selection tool from a plurality of target to-be-selected tools according to the target probability distribution, processing the target subtasks by using the target selection tool to obtain a task processing result, and sending the task processing result to the large language model. The invention solves the technical problem that the agent model has poor tool effect when the agent model is replaced by the selecting tool for processing in the scheme that the large language model in the related technology calls a plurality of agent models and the agent model is replaced by the selecting tool for processing tasks.

Inventors

ZHOU XIAOMAO
JIA QINGMIN
XIE RENCHAO
ZHANG YAN
WANG LIWEN

Assignees

紫金山实验室

Dates

Publication Date: 20260512
Application Date: 20260108

Claims (11)

1. A method of task processing, comprising: Receiving a target subtask, wherein the target subtask is obtained by decomposing a demand task based on a large language model; Determining target probability distribution corresponding to a target tool sequence according to the target subtask and a target tool selector, wherein the target probability distribution is determined according to selection probabilities respectively corresponding to a plurality of target tools to be selected in the target tool sequence, the target tool sequence comprises the plurality of target tools to be selected, the target tool selector is obtained by adjusting selector network parameters of an initial tool selector according to probability difference indexes between first probability distribution and second probability distribution, the first probability distribution is determined according to a sample subtask and a reconstruction model, the second probability distribution is determined according to the sample subtask and the initial tool selector, and a target decision strategy of the large language model is copied in the reconstruction model; determining a target selection tool from the plurality of target to-be-selected tools according to the target probability distribution; Processing the target subtasks by using the target selection tool to obtain a task processing result; and sending the task processing result to the large language model.
2. The method of claim 1, wherein prior to determining a target probability distribution corresponding to a target tool sequence based on the target subtask and a target tool selector, further comprising: Acquiring the sample subtasks; Determining a first probability distribution corresponding to a sample tool sequence according to the sample subtask and the reconstruction model, wherein the sample tool sequence comprises a plurality of sample to-be-selected tools; Determining a second probability distribution corresponding to the sample tool sequence according to the sample subtask and an initial tool selector, wherein the initial tool selector comprises an objective function, and the objective function maximizes an expected reward value obtained by adopting a corresponding sample to-be-selected tool in a preset state; And determining a probability difference index between the first probability distribution and the second probability distribution, and adjusting the selector network parameters of the initial tool selector under the condition that the probability difference index is larger than or equal to a first threshold value until the determined probability difference index is smaller than the first threshold value, so as to obtain the target tool selector after the selector network parameters are adjusted.
3. The method of claim 2, wherein determining a second probability distribution corresponding to the sample tool sequence based on the sample subtask and an initial tool selector comprises: determining a current state according to the sample subtask and the selector network parameters of the initial tool selector; controlling an action selection network in the initial tool selector to determine an initial probability distribution corresponding to the sample tool sequence in the current state; controlling a value evaluation network in the initial tool selector, and determining expected reward values respectively corresponding to the plurality of sample tools to be selected in the current state according to the initial probability distribution and an objective function, wherein the value evaluation network is an evaluation network obtained after training the initial evaluation network under the ordering calibration constraint; And according to the expected rewards corresponding to the plurality of sample to-be-selected tools respectively in the current state, adjusting the action selection network parameters corresponding to the action selection network to determine update probability distribution until the second probability distribution which maximizes the expected rewards corresponding to the update probability distribution is obtained.
4. A method according to claim 3, wherein before controlling the value evaluation network in the initial tool selector to determine the expected prize values respectively corresponding to the plurality of sample candidate tools in the current state according to the initial probability distribution and the objective function, the method further comprises: determining a sorting calibration constraint under the condition that a sample tool sequence comprises a plurality of sample tools to be selected, wherein the sorting calibration constraint is used for constraining sorting similarity between a prediction tool sequence and the sample tool sequence, and the tool sorting mode corresponding to the prediction tool sequence is the same as the tool sorting mode corresponding to the sample tool sequence; Training the initial evaluation network under the ordering calibration constraint, determining ordering similarity between the predicted tool sequence and the sample tool sequence; and under the condition that the sorting similarity is smaller than a second threshold value, adjusting the evaluation network parameters of the initial evaluation network until the determined sorting similarity is larger than or equal to the second threshold value, and obtaining the value evaluation network after adjusting the evaluation network parameters.
5. The method according to any one of claims 1 to 4, wherein before determining a target probability distribution corresponding to a target tool sequence according to the target subtask and target tool selector, comprising: And inputting the target subtasks into a tool screening model to obtain the target tool sequence, wherein the tool screening model is obtained through training data and a loss function, the loss function comprises a tool prediction loss function item and a reason generation loss function item, the tool prediction loss function item represents a loss value between a prediction tool sequence and an actual tool sequence, the reason generation loss function item represents a loss value between a prediction ordering reason and an actual ordering reason, and the prediction tool sequence and the prediction ordering reason in the training data are generated by the large language model.
6. A method of task processing, comprising: Acquiring a demand task; Decomposing the demand task to obtain a plurality of target subtasks; The method comprises the steps of sending a plurality of target subtasks to corresponding proxy models, enabling the corresponding proxy models to receive target subtasks, determining target probability distribution corresponding to a target tool sequence according to the target subtasks and a target tool selector, determining a target selection tool from a plurality of target to-be-selected tools according to the target probability distribution, processing the subtasks by using the target selection tool, wherein the target probability distribution is determined according to selection probabilities respectively corresponding to the plurality of target to-be-selected tools in the target tool sequence, the target tool sequence comprises the plurality of target to-be-selected tools, the target tool selector adjusts selector network parameters of an initial tool selector according to probability difference indexes between a first probability distribution and a second probability distribution, the first probability distribution is determined according to a sample subtask and a reconstruction model, the second probability distribution is determined according to the sample subtask and the initial tool selector, and a target decision strategy of a large language model is copied in the reconstruction model; Receiving task processing results respectively sent by a plurality of agent models; And obtaining a target result according to the plurality of task processing results.
7. The method of claim 6, wherein before said sending the plurality of target subtasks to the corresponding proxy models, further comprising: determining a plurality of preset task topics and selecting tools corresponding to the preset task topics under the condition that the corresponding agent model comprises a tool screening model; constructing a plurality of instruction response pairs according to the plurality of preset task topics and selection tools corresponding to the plurality of preset task topics; according to the multiple instruction responses to the simulation processing task, determining a calling tool and an un-calling tool from the corresponding selecting tools respectively; Generating training data according to the instruction responses, calling tools and non-calling tools corresponding to the preset task topics, wherein the training data comprises a prediction tool sequence and a prediction ordering reason corresponding to the preset task topics; and sending the training data to the proxy model.
8. A method of training a tool selector, comprising: receiving a sample subtask, wherein the sample subtask is obtained by decomposing a sample task by a large language model; Determining a first probability distribution corresponding to a sample tool sequence according to the sample subtask and a reconstruction model, wherein the first probability distribution is determined according to selection probabilities respectively corresponding to a plurality of sample tools to be selected, the sample tool sequence comprises the plurality of sample tools to be selected, and a target decision strategy of the large language model is copied in the reconstruction model; determining a second probability distribution corresponding to the sample tool sequence according to the sample subtask and an initial tool selector, wherein the tool selector comprises an objective function which is maximized with an expected reward value, and the objective function is used for determining the expected reward value corresponding to a corresponding tool in a preset state; And adjusting the selector network parameters of the initial tool selector according to the probability difference index between the first probability distribution and the second probability distribution to obtain a target tool selector after adjusting the selector network parameters, wherein the target tool selector is used for determining target probability distribution corresponding to a target tool sequence, determining a target selection tool from a plurality of target to-be-selected tools according to the target probability distribution, and processing target subtasks by using the target selection tool to obtain a task processing result.
9. A task processing system is characterized by a large language model, a plurality of proxy models, wherein any one of the plurality of proxy models comprises a target tool selector, Any one agent model in the plurality of agent models is used for receiving a target subtask, wherein the target subtask is obtained by decomposing a demand task based on a large language model; Determining target probability distribution corresponding to a target tool sequence according to the target subtask and the target tool selector, wherein the target probability distribution is determined according to selection probabilities respectively corresponding to a plurality of target tools to be selected in the target tool sequence, the target tool sequence comprises the plurality of target tools to be selected, the target tool selector is obtained by adjusting selector network parameters of an initial tool selector according to probability difference indexes between first probability distribution and second probability distribution, the first probability distribution is determined according to a sample subtask and a reconstruction model, and the second probability distribution is determined according to the sample subtask and the initial tool selector; determining a target selection tool from the plurality of target to-be-selected tools according to the target probability distribution; Processing the target subtasks by using the target selection tool to obtain a task processing result; and sending the task processing result to the large language model.
10. An electronic device, comprising: A processor; a memory for storing the processor-executable instructions; Wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 8.
11. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of any one of claims 1 to 8.

Description

Task processing method, system and training method of tool selector Technical Field The invention relates to the field of model processing, in particular to a task processing method, a task processing system and a training method of a tool selector. Background In the scheme that a large language model in the related art calls a plurality of proxy models, the proxy models are replaced by selecting tools to process tasks, and when the proxy models are replaced by selecting tools to process, the technical problem that the effect of the tools selected by the proxy models is poor exists. In view of the above problems, no effective solution has been proposed at present. Disclosure of Invention The embodiment of the invention provides a task processing method, a task processing system and a training method of a tool selector, which are used for at least solving the technical problems that in a scheme that a large language model in the related technology calls a plurality of proxy models, the proxy models are used for selecting tools to process tasks, and when the proxy models are used for selecting tools to process, the effect of the tools selected by the proxy models is poor. According to one aspect of the embodiment of the invention, a task processing method is provided, which comprises the steps of receiving a target subtask, determining target probability distribution corresponding to a target tool sequence according to the target subtask and a target tool selector, determining selection probabilities corresponding to a plurality of target tools to be selected respectively in the target tool sequence according to the target probability distribution, determining the target tools to be selected from the plurality of target tools to be selected according to the target probability distribution, adjusting selector network parameters of an initial tool selector according to probability difference indexes between a first probability distribution and a second probability distribution, determining the first probability distribution according to a sample subtask and a reconstruction model, determining the second probability distribution according to the sample subtask and the initial tool selector, copying a target decision strategy of the large language model in the reconstruction model, determining the target tools to be selected from the plurality of target tools to be selected according to the target probability distribution, and sending a target processing result to the target tool to be processed, wherein the target processing result is obtained by using the target tool to the target processing subtask. Optionally, before determining the target probability distribution corresponding to the target tool sequence according to the target subtask and the target tool selector, the method further comprises the steps of obtaining the sample subtask, determining a first probability distribution corresponding to the sample tool sequence according to the sample subtask and a reconstruction model, wherein the sample tool sequence comprises a plurality of sample to-be-selected tools, determining a second probability distribution corresponding to the sample tool sequence according to the sample subtask and an initial tool selector, wherein the initial tool selector comprises a target function, the target function is maximized to be a target by an expected reward value obtained by adopting the corresponding sample to-be-selected tools in a preset state, determining a probability difference index between the first probability distribution and the second probability distribution, and adjusting a selector network parameter of the initial tool selector until the determined probability difference index is smaller than the first threshold under the condition that the probability difference index is larger than or equal to the first threshold, and obtaining the target tool selector after adjusting the selector network parameter. Optionally, the determining the second probability distribution corresponding to the sample tool sequence according to the sample subtask and the initial tool selector includes determining a current state according to selector network parameters of the sample subtask and the initial tool selector, controlling an action selection network in the initial tool selector to determine an initial probability distribution corresponding to the sample tool sequence in the current state, controlling a value evaluation network in the initial tool selector to determine expected reward values corresponding to the plurality of sample tools respectively in the current state according to the initial probability distribution and an objective function, wherein the value evaluation network is an evaluation network obtained after training the initial evaluation network under a sorting calibration constraint, and adjusting action selection network parameters corresponding to the action selection network according to the expected rewa