CN-122021968-A - Reinforced learning method and device for large model

CN122021968ACN 122021968 ACN122021968 ACN 122021968ACN-122021968-A

Abstract

A reinforcement learning method and device for a large model. In the method, a user question is input into a large model to be trained, a plurality of responses and confidence degrees corresponding to the responses are obtained, and correctness rewards of the responses are determined according to correct answers of the user question. And for any response, performing weighting operation and clipping operation based on the confidence coefficient of the response and the correctness rewards of the response to obtain rewards of the response. Wherein the weighting operation is for using a higher confidence correct response with a greater reward than a lower confidence correct response with a lower confidence incorrect response. The clipping operation is used to take the lower bound as a prize when the correctly responded prize is below the lower bound and to take the upper bound as a prize when the incorrectly responded prize is above the upper bound. Next, a model update based on the reinforcement learning algorithm is performed based on the rewards of the plurality of responses to update the large model. The training data used in the training process needs privacy protection.

Inventors

CHEN ZHONGQI
ZHANG BAINAN
SONG BOWEN

Assignees

支付宝(杭州)数字服务技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260121

Claims (10)

1. A reinforcement learning method for a large model, comprising: inputting the user problem into a large model to be trained to obtain a plurality of responses and confidence degrees corresponding to the responses respectively; Determining correctness rewards corresponding to the plurality of responses respectively according to the correct answers of the user questions, wherein the correctness rewards are used for representing that the corresponding responses belong to correct responses or error responses; For any response, carrying out weighting operation and cutting operation based on the confidence coefficient of the response and the correctness rewards of the response to obtain rewards of the response, wherein the weighting operation is used for weighting the correctness rewards of the response by taking the confidence coefficient of the response as weight so that the rewards of the high-confidence correct response are larger than those of the low-confidence correct response, and the rewards of the high-confidence incorrect response are smaller than those of the low-confidence incorrect response; and performing model updating based on a reinforcement learning algorithm according to rewards of the plurality of responses to update the large model.
2. The method of claim 1, the step of inputting user questions into the large model to be trained, comprising: Inputting the user problem into the large model to obtain a plurality of responses; Inputting the user problem and the responses into a 0 th version large model to obtain confidence degrees respectively corresponding to the responses, wherein the 0 th version large model is a large model before reinforcement learning is performed on the large model.
3. The method of claim 2, the step of inputting the user question and the plurality of responses into a version 0 large model comprising: Inputting the user questions and the responses into a version 0 large model, and determining the confidence level of each text unit in any response through the version 0 large model; for any one response, calculating a geometric average value of the confidence coefficient of each text unit in the response, and taking the geometric average value as the confidence coefficient of the response.
4. The method of claim 1, the step of performing a weighting operation and a clipping operation based on the confidence of the response and the correctness rewards of the response, comprising: taking the confidence coefficient of the response as weight to weight the correctness rewards of the response to obtain confidence coefficient weighted rewards; And cutting the confidence weighted rewards according to the lower limit value and the upper limit value to obtain the rewards of the response.
5. The method of claim 4 wherein the step of weighting the correctness rewards of the response with the confidence of the response as a weight comprises multiplying the confidence of the response with the correctness rewards of the response and taking the resulting product as the confidence weighted rewards.
6. The method of claim 4, the step of clipping the confidence weighted rewards in accordance with the lower bound and the upper bound comprising: When the response is a correct response and the confidence weighted reward is below the lower bound, regarding the lower bound as the reward for the response; When the response is an error response and the confidence weighted reward is above the upper bound, regarding the upper bound as the reward for the response; The confidence weighted reward is treated as a reward for the response when the response is a correct response and the confidence weighted reward is not less than the lower bound or when the response is an incorrect response and the confidence weighted reward is not greater than the upper bound.
7. The method of claim 1, wherein the correctness rewards of the correct response have a value of 1, the correctness rewards of the incorrect response have a value of-1, the confidence has a value ranging from 0 to 1, the lower bound has a value ranging from 0 to 1, and the upper bound has a value ranging from-1 to 0.
8. A reinforcement learning device for a large model, comprising: the response determining module is configured to input the user problem into the large model to be trained to obtain a plurality of responses and confidence degrees corresponding to the responses respectively; the correctness judgment module is configured to determine correctness rewards respectively corresponding to the plurality of responses according to correct answers of the user questions, wherein the correctness rewards are used for representing that the corresponding responses belong to correct responses or error responses; The rewarding determining module is configured to execute a weighting operation and a clipping operation for any response based on the confidence coefficient of the response and the correctness rewards of the response to obtain rewards of the response, wherein the weighting operation is used for weighting the correctness rewards of the response by taking the confidence coefficient of the response as a weight so that the rewards of the correct response with high confidence coefficient are larger than those of the correct response with low confidence coefficient, and the rewards of the incorrect response with high confidence coefficient are smaller than those of the incorrect response with low confidence coefficient; a model update module configured to perform a reinforcement learning algorithm based model update to update the large model based on rewards of the plurality of responses.
9. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-7.
10. A computing device comprising a memory having executable code stored therein and a processor, which when executing the executable code, implements the method of any of claims 1-7.

Description

Reinforced learning method and device for large model Technical Field One or more embodiments of the present disclosure relate to the field of machine learning, and in particular, to a reinforcement learning method and apparatus for a large model. Background Large language models (Large Language Models, LLMs) have made breakthrough progress in recent years, and their excellent performance on various complex tasks has attracted a great deal of attention. However, to better align LLMs with human intent or a particular task goal, the reinforcement learning (Reinforcement Learning, RL) phase is typically entered after pre-training and instruction fine tuning. In the RL phase, the quality of the reward signal is critical to model learning effect and stability. Privacy protection is also required when the sample data used in the training process contains private data. At present, an improved scheme is desired, which can improve training effect and stability when the large model is subjected to reinforcement learning, so that the performance of the large model is improved. Disclosure of Invention One or more embodiments of the present specification describe a reinforcement learning method and apparatus for a large model to improve training effect and stability when reinforcement learning is performed on the large model, thereby improving performance of the large model. The specific technical scheme is as follows. In a first aspect, an embodiment provides a reinforcement learning method for a large model, including: inputting the user problem into a large model to be trained to obtain a plurality of responses and confidence degrees corresponding to the responses respectively, wherein the confidence degrees are used for representing the fluency and the certainty of the corresponding responses; Determining correctness rewards corresponding to the plurality of responses respectively according to the correct answers of the user questions, wherein the correctness rewards are used for representing that the corresponding responses belong to correct responses or error responses; For any response, carrying out weighting operation and cutting operation based on the confidence coefficient of the response and the correctness rewards of the response to obtain rewards of the response, wherein the weighting operation is used for weighting the correctness rewards of the response by taking the confidence coefficient of the response as weight so that the rewards of the high-confidence correct response are larger than those of the low-confidence correct response, and the rewards of the high-confidence incorrect response are smaller than those of the low-confidence incorrect response; and performing model updating based on a reinforcement learning algorithm according to rewards of the plurality of responses to update the large model. In one implementation, the step of inputting the user problem into the large model to be trained includes: Inputting the user problem into the large model to obtain a plurality of responses; Inputting the user problem and the responses into a 0 th version large model to obtain confidence degrees respectively corresponding to the responses, wherein the 0 th version large model is a large model before reinforcement learning is performed on the large model. In one implementation, the step of inputting the user question and the plurality of responses into a version 0 large model includes: Inputting the user questions and the responses into a version 0 large model, and determining the confidence level of each text unit in any response through the version 0 large model; for any one response, calculating a geometric average value of the confidence coefficient of each text unit in the response, and taking the geometric average value as the confidence coefficient of the response. In one implementation, the step of performing a weighting operation and a clipping operation based on the confidence of the response and the correctness rewards of the response includes: taking the confidence coefficient of the response as weight to weight the correctness rewards of the response to obtain confidence coefficient weighted rewards; And cutting the confidence weighted rewards according to the lower limit value and the upper limit value to obtain the rewards of the response. In one implementation, the step of weighting the correctness rewards of the response with the confidence of the response as a weight includes multiplying the confidence of the response with the correctness rewards of the response and taking the resulting product as the confidence weighted rewards. In one implementation, the step of clipping the confidence weighted rewards according to the lower bound and the upper bound includes: When the response is a correct response and the confidence weighted reward is below the lower bound, regarding the lower bound as the reward for the response; When the response is an error response and the confidence weighted rewa