CN-122021702-A - Intelligent body construction method, intelligent body construction device, electronic equipment and storage medium
Abstract
The invention relates to the technical field of natural language processing and provides an agent construction method, an agent construction device, electronic equipment and a storage medium, wherein the method comprises the steps of generating a response result by using an agent model aiming at a training sample, and calculating the sample grasping degree of the agent model to the training sample according to a reward score aiming at the response result; determining a target training mode for training samples from the supervised learning mode and the reinforcement learning mode according to the sample mastering degree, determining a target loss value based on the target training mode, and updating parameters of the intelligent agent model by using the target loss value to obtain the target intelligent agent. According to the method, the sample mastering degree of the intelligent body model is calculated in real time aiming at the training sample, and the target training mode is adaptively switched between the supervised learning mode and the reinforcement learning mode according to the sample mastering degree, so that the problem of unbalanced training caused by long tail distribution of data is effectively solved, and flow splitting and performance bottleneck caused by simple staged training are avoided.
Inventors
- YAN HAN
- LIU QUAN
- WEI SI
- LIU CONG
Assignees
- 科大讯飞股份有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260129
Claims (13)
- 1. A method of constructing an agent, comprising: Generating a response result by using an agent model aiming at a training sample, and calculating the sample grasping degree of the agent model on the training sample according to the reward score aiming at the response result; determining a target training pattern for the training sample from a supervised learning pattern and a reinforcement learning pattern according to the sample mastery; Determining a target loss value based on the target training mode, wherein the target loss value is determined according to a label of the training sample when the target training mode is the supervised learning mode; And updating parameters of the intelligent agent model by using the target loss value to obtain a target intelligent agent.
- 2. The agent construction method according to claim 1, wherein the generating a response result using an agent model for a training sample and calculating a sample mastery of the training sample by the agent model based on a reward score for the response result comprises: Controlling the intelligent agent model to execute multiple times of reasoning generation aiming at the training sample to obtain multiple response results, and determining the rewarding scores of the response results; And counting the proportion of response results with the reward score exceeding a preset score threshold value in the response results, and determining the proportion as the sample grasping degree of the training sample by the agent model.
- 3. The method of claim 2, wherein determining a reward score for each response outcome comprises: Obtaining format correctness scores, semantic accuracy scores and thinking chain density penalty items aiming at each response result; carrying out weighted summation on the format correctness score, the semantic accuracy score and the thinking chain density penalty term to obtain a reward score corresponding to a response result; The thinking chain density penalty term is used for applying negative penalty to the corresponding reward score under the condition that the length of the corresponding response result exceeds a preset length threshold value.
- 4. A method of constructing an agent according to claim 3, wherein the step of determining the mental chain density penalty term comprises: determining the actual generation length of the thinking chain part in the response result; Calculating a length difference value between the actual generated length and the preset length threshold; And calculating the thinking chain density penalty term based on the length difference value.
- 5. The method according to any one of claims 1 to 4, wherein the determining a target training pattern for the training sample from a supervised learning pattern and a reinforcement learning pattern according to the sample mastery comprises: taking the supervised learning mode as the target training mode when the sample mastery level is less than or equal to a mastery level threshold; and when the sample grasping degree is greater than the grasping degree threshold, the reinforcement learning mode is set as the target training mode.
- 6. The agent construction method according to any one of claims 1 to 4, further comprising, in the case where the target training pattern is the supervised learning pattern, before determining a target loss value based on the target training pattern: constructing an input part of the training sample, wherein the input part comprises a user intention classification label, a short thinking chain guide and a tool calling instruction; constructing a label part of the training sample, wherein the label part comprises thinking chain analysis content for user intention and reply generation content; wherein the mental chain analysis content is constrained within a preset first character quantity range, the tool call instruction is constrained to a preset data exchange format, and the reply generation content is constrained within a preset second character quantity range.
- 7. The method according to any one of claims 1 to 4, further comprising, after updating parameters of the agent model with the target loss value to obtain a target agent: receiving a natural language request input by a user; carrying out stream reasoning by utilizing the target agent to generate thinking chain content containing intention categories; And when detecting that the thinking chain content outputs the intention category ending mark, asynchronously triggering an external tool calling request corresponding to the intention category.
- 8. The agent construction method according to any one of claims 1 to 4, further comprising, before generating a response result using the agent model for the training sample: Acquiring an initial large language model and an initialization data set, wherein the initialization data set comprises instruction compliance data with an intention classification label and standard data format instructions; And performing supervised fine tuning on the initial large language model by using the initialization data set to obtain the intelligent agent model.
- 9. The method of any one of claims 1 to 4, wherein the determining a target loss value based on the target training pattern comprises: calculating a basic supervised loss based on a standard label of the training sample and determining a product of the basic supervised loss and a fine tuning weight parameter as the target loss value when the target training mode is the supervised learning mode; and calculating a basic reinforcement loss based on the reward score when the target training mode is the reinforcement learning mode, and determining the product of the basic reinforcement loss and a reinforcement weight parameter as the target loss value.
- 10. The agent construction method according to claim 9, wherein the calculating a base reinforcement loss based on the bonus score includes: Acquiring a reward baseline calculated for the training sample, wherein the reward baseline represents the average reward level of the intelligent agent model in the current state; Calculating the difference value between the reward score and the reward base line to obtain a merit function value; And generating a product of the logarithmic probability gradient of the response result and the dominance function value based on the agent model, and calculating the basic strengthening loss.
- 11. An agent build device, comprising: The calculation module is used for generating a response result by using the agent model aiming at the training sample, and calculating the sample grasping degree of the agent model on the training sample according to the reward score aiming at the response result; A selection module for determining a target training pattern for the training sample from a supervised learning pattern and a reinforcement learning pattern according to the sample mastery; The system comprises a target training mode, a determining module, a reward score, a target loss value determining module and a reward score determining module, wherein the target training mode is used for determining a target loss value based on the target training mode, wherein the target loss value is determined according to a label of the training sample when the target training mode is the supervised learning mode; And the updating module is used for updating the parameters of the intelligent agent model by utilizing the target loss value to obtain a target intelligent agent.
- 12. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor implements the agent building method of any one of claims 1 to 10 when the computer program is executed.
- 13. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the agent construction method according to any one of claims 1 to 10.
Description
Intelligent body construction method, intelligent body construction device, electronic equipment and storage medium Technical Field The present invention relates to the field of natural language processing technologies, and in particular, to an agent construction method, an agent construction device, an electronic device, and a storage medium. Background With the rapid development of artificial intelligence technology, intelligent agents based on large language models (Large Language Model, LLM) are increasingly used in the field of financial customer service. Existing methods of agent construction are typically trained using supervised fine Tuning (Supervised Fine-Tuning, SFT) or reinforcement learning (Reinforcement Learning, RL). However, SFT relies on high quality labeling data, has limited generalization capability and poor performance on long tail problems, and the problems of difficult cold start, unstable training and blind exploration exist in pure RL training. Disclosure of Invention The invention provides an agent construction method, an agent construction device, electronic equipment and a storage medium, which are used for solving the defects in the prior art. The invention provides an agent construction method, which comprises the following steps: Generating a response result by using an agent model aiming at a training sample, and calculating the sample grasping degree of the agent model on the training sample according to the reward score aiming at the response result; determining a target training pattern for the training sample from a supervised learning pattern and a reinforcement learning pattern according to the sample mastery; Determining a target loss value based on the target training mode, wherein the target loss value is determined according to a label of the training sample when the target training mode is the supervised learning mode; And updating parameters of the intelligent agent model by using the target loss value to obtain a target intelligent agent. According to the method for constructing the intelligent agent, aiming at a training sample, a response result is generated by utilizing an intelligent agent model, and the sample grasping degree of the intelligent agent model on the training sample is calculated according to the reward score aiming at the response result, and the method comprises the following steps: Controlling the intelligent agent model to execute multiple times of reasoning generation aiming at the training sample to obtain multiple response results, and determining the rewarding scores of the response results; And counting the proportion of response results with the reward score exceeding a preset score threshold value in the response results, and determining the proportion as the sample grasping degree of the training sample by the agent model. According to the method for constructing the intelligent agent, the determining of the reward score of each response result comprises the following steps: Obtaining format correctness scores, semantic accuracy scores and thinking chain density penalty items aiming at each response result; carrying out weighted summation on the format correctness score, the semantic accuracy score and the thinking chain density penalty term to obtain a reward score corresponding to a response result; The thinking chain density penalty term is used for applying negative penalty to the corresponding reward score under the condition that the length of the corresponding response result exceeds a preset length threshold value. According to the method for constructing the intelligent agent, the determining step of the thinking chain density penalty term comprises the following steps: determining the actual generation length of the thinking chain part in the response result; Calculating a length difference value between the actual generated length and the preset length threshold; And calculating the thinking chain density penalty term based on the length difference value. According to the method for constructing the intelligent agent, according to the grasping degree of the sample, a target training mode aiming at the training sample is determined from a supervised learning mode and a reinforcement learning mode, and the method comprises the following steps: taking the supervised learning mode as the target training mode when the sample mastery level is less than or equal to a mastery level threshold; and when the sample grasping degree is greater than the grasping degree threshold, the reinforcement learning mode is set as the target training mode. According to the method for constructing the intelligent agent, under the condition that the target training mode is the supervised learning mode, before the target loss value is determined based on the target training mode, the method further comprises the following steps: constructing an input part of the training sample, wherein the input part comprises a user intention classification label