CN-121981254-A - Training method of reply model, reply acquisition method, device and equipment

CN121981254ACN 121981254 ACN121981254 ACN 121981254ACN-121981254-A

Abstract

The invention provides a training method, a response acquisition method, a device and equipment of a response model, relates to the technical field of large models, in particular to the technical field of artificial intelligence such as natural language processing, and the method comprises the steps of acquiring a candidate response model to be trained and corresponding sample questions, obtaining output responses of the candidate response model based on each of a plurality of thinking chain segments based on the sample questions, wherein the segments of the thinking chain segments are different in length, and iterating the candidate response model based on the output responses of the thinking chain segments and the segments of the thinking chain segments to obtain a trained target response model.

Inventors

Bian Tingcheng
LUO JINCHANG
CHENG MINGQUAN
Wan fan
XIA XIAOLING
LI SHU
WANG HAIWEI

Assignees

北京百度网讯科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251219

Claims (16)

1. A method of training a reply model, wherein the method comprises: Obtaining a candidate answer model to be trained and a corresponding sample question; obtaining output answers of the candidate answer model based on each of a plurality of mental chain segments based on the sample questions, wherein the segments of the plurality of mental chain segments are different in length; And iterating the candidate answer model based on the output answers of the plurality of thinking chain fragments and the fragment lengths of the plurality of thinking chain fragments, so as to obtain a trained target answer model.
2. The method of claim 1, wherein the deriving the candidate answer model based on the sample question results in an output answer based on each of a plurality of mental chain segments, wherein the plurality of mental chain segments each have a different segment length, comprising: performing multiple rounds of mental chain truncation on the candidate reply model based on a plurality of preset mental chain truncation positions to obtain mental chain fragments after each round of truncation, wherein the plurality of mental chain truncation positions are different; And obtaining output answers of the sample questions under each thinking chain segment based on the thinking chain segments under each cut-off round.
3. The method according to claim 2, wherein the performing multiple rounds of mental chain truncation on the candidate reply model based on a preset plurality of mental chain truncation positions to obtain mental chain segments after each round of truncation, wherein the plurality of mental chain truncation positions are different from one another, includes: And determining a thinking chain cutting position corresponding to any round, and putting a stopping thinking mark in a calibrated thinking chain of the candidate reply model based on the thinking chain cutting position so as to cut off the calibrated thinking chain to obtain a cut-off thinking chain segment.
4. A method according to claim 3, wherein the method further comprises: And inputting the sample questions into the candidate answer model to obtain a plurality of candidate thinking chains in the candidate answer model, and determining the calibrated thinking chain of the candidate answer model in the current training round from the plurality of candidate thinking chains based on the respective thinking chain lengths of the plurality of candidate thinking chains.
5. The method of claim 1, wherein iterating the candidate answer model based on the output answers of each of the plurality of mental chain segments and the segment lengths of each of the plurality of mental chain segments to obtain a trained target answer model comprises: Obtaining a target rewarding value combination corresponding to the plurality of thinking chain fragments based on the output answers of the plurality of thinking chain fragments and the fragment lengths of the plurality of thinking chain fragments; and carrying out iterative optimization on the candidate answer model based on the target reward value combination to obtain the target answer model.
6. The method according to claim 5, wherein the obtaining the target prize value combination corresponding to the plurality of mental chain segments based on the output replies of the plurality of mental chain segments and the segment lengths of the plurality of mental chain segments, comprises: Obtaining candidate thinking chain segments based on the label answers corresponding to the sample questions and the output answers of the multiple thinking chain segments; Determining a target thinking chain segment which accords with the preset thinking chain length limiting condition of the candidate answer model from the candidate thinking chain segments based on the segment length of the candidate thinking chain segment; And obtaining the target rewarding value combination corresponding to the plurality of mind chain segments based on the first rewarding value of the target mind chain segment and the second rewarding value of the rest of the mind chain segments except the target mind chain segment in the plurality of mind chains.
7. The method according to claim 6, wherein the obtaining candidate mental chain segments based on the tag responses corresponding to the sample questions and the output responses of each of the plurality of mental chain segments comprises: and determining a target output answer matched with the label answer from output answers of each of the plurality of mind chain fragments, and determining the mind chain fragment corresponding to the target output answer as the candidate mind chain fragment.
8. The method according to claim 6, wherein the determining, from the candidate mind chain segments, a target mind chain segment that meets a preset mind chain length constraint of the candidate answer model based on segment lengths of the candidate mind chain segments, comprises: acquiring the reference thinking chain segment length of the candidate reply model under the current training round; In response to the segment length of the candidate mental chain segment being less than or equal to the reference mental chain segment length, determining the candidate mental chain segment as the target mental chain segment that meets the mental chain length constraint.
9. The method of claim 5, wherein iteratively optimizing the candidate answer model based on the target prize value combination to obtain the target answer model comprises: acquiring a historical rewarding value combination corresponding to a historical thinking chain segment of the candidate reply model; obtaining corresponding variances based on the historical rewarding value combination and the target rewarding value combination, and obtaining respective dominant values of the plurality of thinking chain segments based on the variances; And adjusting model parameters of the candidate answer model based on the dominance values of the multiple thought chain segments, and returning to acquire the next sample question to train the candidate answer model with the parameters adjusted, so as to obtain the trained target answer model.
10. The method according to claim 5, wherein the obtaining the target prize value combination corresponding to the plurality of mental chain segments based on the output replies of the plurality of mental chain segments and the segment lengths of the plurality of mental chain segments further comprises: and ending the current round of training and returning to acquire a next sample problem to perform model training of the candidate answer model in the next round in response to the target bonus value combination being an abnormal bonus value combination, wherein the abnormal bonus value combination at least comprises an all 1 bonus value combination and an all 0 bonus value combination.
11. A reply acquisition method, wherein the method comprises: Acquiring a user question to be replied and inputting the user question into a corresponding target reply model, and analyzing the user question through the target reply model to determine a corresponding target thinking chain segment of the user question, wherein the target reply model is obtained based on the training method of the reply model according to any one of claims 1-10; and obtaining the target answer output by the target answer model based on the target thinking chain segment.
12. A training apparatus for a reply model, wherein the apparatus comprises: The first acquisition module is used for acquiring a candidate answer model to be trained and a corresponding sample problem; The first answer module is used for obtaining output answers of the candidate answer model based on each of a plurality of thinking chain segments based on the sample questions, wherein the segments of the plurality of thinking chain segments are different in length; and the training module is used for iterating the candidate answer model based on the output answers of the plurality of thinking chain fragments and the fragment lengths of the plurality of thinking chain fragments so as to obtain a trained target answer model.
13. A reply acquisition device, wherein the device comprises: A second obtaining module, configured to obtain a user question to be answered and input the user question into a corresponding target answer model, and parse the user question through the target answer model to determine a corresponding target thought chain segment of the user question, where the target answer model is obtained based on the training device of the answer model in claim 12; And the second reply module is used for obtaining the target reply output by the target reply model based on the target thinking chain segment.
14. An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10 and/or 11.
15. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-10 and/or 11.
16. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-10 and/or 11.

Description

Training method of reply model, reply acquisition method, device and equipment Technical Field The present disclosure relates to the field of large model technology, and in particular, to the field of artificial intelligence such as natural language processing. Background With the development of technology, more and more people choose to obtain the required answer to the questions through the intelligent agent, in this scenario, the model deployed in the intelligent agent can generate corresponding answer results based on the pre-constructed knowledge base, however, when facing the related complex scenario, the model may have a situation that response delay occurs, which affects the answer experience of the user. Disclosure of Invention The disclosure provides a training method, a reply acquisition method, a device and equipment for a reply model. According to a first aspect of the disclosure, a training method of a reply model is provided, and the method comprises the steps of obtaining a candidate reply model to be trained and a corresponding sample question, obtaining output replies of the candidate reply model based on each of a plurality of thinking chain segments based on the sample question, wherein the segments of the thinking chain segments are different in length, and iterating the candidate reply model based on the output replies of the thinking chain segments and the segments of the thinking chain segments to obtain a trained target reply model. According to a second aspect of the present disclosure, a reply acquisition method is provided, which includes acquiring a user question to be replied and inputting the user question into a corresponding target reply model, analyzing the user question by the target reply model to determine a corresponding target thinking chain segment of the user question, wherein the target reply model is obtained based on the training method of the reply model provided in the first aspect, and obtaining a target reply output by the target reply model based on the target thinking chain segment. According to a third aspect of the disclosure, a training device for a reply model is provided, the device comprises a first obtaining module, a first reply module and a training module, wherein the first obtaining module is used for obtaining a candidate reply model to be trained and a corresponding sample problem, the first reply module is used for obtaining output replies of the candidate reply model based on each thinking chain segment of a plurality of thinking chain segments, wherein the segments of the plurality of thinking chain segments are different in length, and the training module is used for iterating the candidate reply model based on the output replies of the plurality of thinking chain segments and the segments of the plurality of thinking chain segments to obtain a trained target reply model. According to a fourth aspect of the present disclosure, a reply acquisition device is provided, where the device includes a second acquisition module configured to acquire a user question to be replied and input the user question into a corresponding target reply model, and parse the user question through the target reply model to determine a corresponding target thought chain segment of the user question, where the target reply model is obtained based on the training device of the reply model provided in the third aspect, and a second reply module configured to obtain a target reply output by the target reply model based on the target thought chain segment. According to a fifth aspect of the present disclosure, an electronic device is provided, comprising at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the training method of the answer model set forth in the first aspect and/or the answer acquisition method set forth in the second aspect. According to a sixth aspect of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the training method of the answer model set forth in the first aspect above is provided. According to a seventh aspect of the present disclosure, a computer program product is presented, comprising a computer program which, when executed by a processor, implements the training method of the answer model presented in the first aspect and/or the answer acquisition method presented in the second aspect. It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification. Drawings The drawings are