CN-121997937-A - Model training method, device, equipment and storage medium

CN121997937ACN 121997937 ACN121997937 ACN 121997937ACN-121997937-A

Abstract

The application discloses a model training method, device, equipment and storage medium, and relates to the technical field of computers. The method comprises the steps of generating at least one output text based on input prompt texts through a reference strategy model, obtaining context entropy information of the output text for each output text, determining update weight information of the output text based on the context entropy information of the output text, wherein the update weight information of the output text comprises update weight of at least one word element in the output text, the update weight of the word element is used for indicating the influence degree of the word element on parameters of the current strategy model, and adjusting the parameters of the current strategy model based on the update weight information of each output text. The method realizes the fine adjustment of the word element level of the parameters of the current strategy model, thereby improving the learning efficiency of the current strategy model.

Inventors

Request for anonymity

Assignees

摩尔线程智能科技(北京)股份有限公司

Dates

Publication Date: 20260508
Application Date: 20260128

Claims (15)

1. A method of model training, the method comprising: generating at least one output text based on the input prompt text by a reference strategy model, the reference strategy model being used for assisting in training a current strategy model; For each output text, obtaining context entropy information of the output text, wherein the context entropy information comprises context entropy of at least one word element in the output text, and the context entropy of the word element is used for indicating uncertainty of the reference strategy model in generating the word element; Determining update weight information of the output text based on the context entropy information of the output text, wherein the update weight information of the output text comprises update weight of at least one word element in the output text, and the update weight of the word element is used for indicating the influence degree of the word element on the parameters of the current strategy model; And adjusting parameters of the current strategy model based on the updated weight information of each output text.
2. The method of claim 1, wherein the obtaining contextual entropy information of the output text comprises: Acquiring reference probability distribution information of the output text, wherein the reference probability distribution information of the output text is used for indicating the generation probability distribution of the reference strategy model when the output text is generated; And determining context entropy information of the output text based on the reference probability distribution information of the output text.
3. The method of claim 2, wherein the reference probability distribution information of the output text includes lemma reference probability information of at least one lemma in the output text, the lemma reference probability information being used to indicate a generation probability distribution of the reference policy model for at least one candidate lemma when generating the lemma; the determining the context entropy information of the output text based on the reference probability distribution information of the output text comprises the following steps: For each word element in the output text, determining the context entropy of the word element based on the word element reference probability information of the word element; And determining the context entropy of at least one word element in the output text as the context entropy information of the output text.
4. The method of claim 1, wherein the obtaining contextual entropy information of the output text comprises: The attention information of the output text is obtained and is used for indicating the attention degree of the reference strategy model to at least one word element in the output text; Based on the attention information of the output text, contextual entropy information of the output text is determined.
5. The method of claim 4, wherein the attention information of the output text includes attention weight information of at least one lemma in the output text, the attention weight information of the lemma including attention weight of at least one pre-gram of the lemma, the pre-gram being a lemma generated prior to generating the lemma, the attention weight of the pre-gram being used to indicate a degree of importance of the pre-gram to generating the lemma; The determining the contextual entropy information of the output text based on the attention information of the output text comprises: For each term in the output text, determining the context entropy of the term based on the attention weight information of the term; And determining the context entropy of at least one word element in the output text as the context entropy information of the output text.
6. The method of claim 1, wherein the determining updated weight information for the output text based on contextual entropy information for the output text comprises: for each word element in the output text, determining a correction coefficient of the word element based on the context entropy of the word element, wherein the correction coefficient of the word element is used for indicating the keyword degree of the word element when the reference strategy model generates the output text; Acquiring soft gating weights of the word elements, wherein the soft gating weights of the word elements are used for controlling the adjustment amplitude of parameters of the current strategy model; determining the updating weight of the word element based on the correction coefficient of the word element and the soft gating weight of the word element; And determining the update weight of at least one word element in the output text as the update weight information of the output text.
7. The method of claim 6, wherein determining the correction coefficients for the token based on the contextual entropy of the token comprises: Acquiring a preset maximum context entropy and a preset minimum context entropy; And determining a correction coefficient of the word element based on the maximum context entropy, the minimum context entropy and the context entropy of the word element.
8. The method of claim 6, wherein the obtaining the soft gating weights for the tokens comprises: Obtaining a reference generation probability of the word element, wherein the reference generation probability of the word element is used for indicating the possibility of generating the word element in the process of generating the output text by the reference strategy model; acquiring the current generation probability of the word element, wherein the current generation probability of the word element is used for indicating the possibility of generating the word element in the process of generating the output text by the current strategy model; Calculating an importance ratio of the word element based on the reference generation probability of the word element and the current generation probability of the word element, wherein the importance ratio of the word element is used for indicating the deviation degree of the reference strategy model and the current strategy model when the word element is generated; And determining the soft gating weight of the word element based on the importance ratio of the word element.
9. The method of claim 6, wherein the determining the updated weight for the token based on the correction coefficient for the token and the soft gating weight for the token comprises: And determining the product of the correction coefficient of the word element and the soft gating weight of the word element as the updating weight of the word element.
10. The method according to any one of claims 1 to 9, wherein said adjusting parameters of the current policy model based on updated weight information of each of the output texts comprises: Determining a loss function value of the current strategy model based on the updated weight information of each output text; and adjusting parameters of the current strategy model based on the loss function value.
11. The method of claim 10, wherein the determining the loss function value of the current policy model based on the updated weight information for each of the output texts comprises: Obtaining the reward points of each output text, wherein the reward points of the output text are used for indicating the matching degree between the output text and the target task indicated by the input prompt text; Determining a loss function value of the current policy model based on updated weight information of each output text, a reward score of each output text and importance information of each output text, wherein the importance information of the output text comprises an importance ratio of at least one word element in the output text, and the importance ratio of the word element is used for indicating the deviation degree of the reference policy model and the current policy model when generating the word element.
12. A model training apparatus, the apparatus comprising: The generation module is used for generating at least one output text based on the input prompt text through a reference strategy model, and the reference strategy model is used for assisting in training the current strategy model; The acquisition module is used for acquiring context entropy information of the output text for each output text, wherein the context entropy information comprises context entropy of at least one word element in the output text, and the context entropy of the word element is used for indicating uncertainty of the reference strategy model in generating the word element; A determining module, configured to determine update weight information of the output text based on context entropy information of the output text, where the update weight information of the output text includes update weights of at least one word element in the output text, where the update weights of the word element are used to indicate a degree of influence of the word element on parameters of the current policy model; and the adjusting module is used for adjusting parameters of the current strategy model based on the updated weight information of each output text.
13. A computer device comprising a processor and a memory, the memory having stored therein a computer program that is loaded and executed by the processor to implement the method of any of claims 1 to 11.
14. A computer readable storage medium, characterized in that the storage medium has stored therein a computer program for execution by a processor for implementing the method according to any one of claims 1 to 11.
15. A computer program product, characterized in that the computer program product comprises a computer program that is loaded and executed by a processor to implement the method of any one of claims 1 to 11.

Description

Model training method, device, equipment and storage medium Technical Field The embodiment of the application relates to the technical field of computers, in particular to a model training method, device, equipment and storage medium. Background Currently, with the development of computer technology, large language models are widely used in various fields. Large language models require extensive training to have powerful language understanding capabilities. In the related art, at least one output text is generated based on an input prompt text through a large language model, and then the at least one output text is scored using a reward model, resulting in a reward value of the at least one output text. And adjusting parameters of the large language model based on the reward value of at least one output text to realize training of the large language model. The related technology evaluates the whole output text and adjusts the output text level of the large language model, so that the learning effect is poor at a key reasoning step in the training process, and the problem of low learning efficiency of the large language model is caused. Disclosure of Invention The embodiment of the application provides a model training method, device, equipment and storage medium. The technical scheme provided by the embodiment of the application is as follows: according to an aspect of an embodiment of the present application, there is provided a model training method, the method including: generating at least one output text based on the input prompt text by a reference strategy model, the reference strategy model being used for assisting in training a current strategy model; For each output text, obtaining context entropy information of the output text, wherein the context entropy information comprises context entropy of at least one word element in the output text, and the context entropy of the word element is used for indicating uncertainty of the reference strategy model in generating the word element; Determining update weight information of the output text based on the context entropy information of the output text, wherein the update weight information of the output text comprises update weight of at least one word element in the output text, and the update weight of the word element is used for indicating the influence degree of the word element on the parameters of the current strategy model; And adjusting parameters of the current strategy model based on the updated weight information of each output text. According to an aspect of an embodiment of the present application, there is provided a model training apparatus, the apparatus including: The generation module is used for generating at least one output text based on the input prompt text through a reference strategy model, and the reference strategy model is used for assisting in training the current strategy model; The acquisition module is used for acquiring context entropy information of the output text for each output text, wherein the context entropy information comprises context entropy of at least one word element in the output text, and the context entropy of the word element is used for indicating uncertainty of the reference strategy model in generating the word element; A determining module, configured to determine update weight information of the output text based on context entropy information of the output text, where the update weight information of the output text includes update weights of at least one word element in the output text, where the update weights of the word element are used to indicate a degree of influence of the word element on parameters of the current policy model; and the adjusting module is used for adjusting parameters of the current strategy model based on the updated weight information of each output text. According to an aspect of an embodiment of the present application, there is provided a computer device including a processor and a memory, in which a computer program is stored, the computer program being loaded and executed by the processor to implement the above-described model training method. According to an aspect of an embodiment of the present application, there is provided a computer readable storage medium having stored therein a computer program loaded and executed by a processor to implement the above-described model training method. According to an aspect of an embodiment of the present application, there is provided a computer program product comprising a computer program loaded and executed by a processor to implement the above-described model training method. The technical scheme provided by the embodiment of the application at least comprises the following beneficial effects: After at least one output text is generated through the reference strategy model, the context entropy of each word element in the output text is respectively obtained to evaluate the uncertainty of the reference strategy model in