CN-122021928-A - Model training method and device based on updating reward rule set and reinforcement learning

CN122021928ACN 122021928 ACN122021928 ACN 122021928ACN-122021928-A

Abstract

The embodiment of the specification discloses updating and device of a reward rule set and a model training method and device based on reinforcement learning. The updating method comprises the steps of firstly sampling a batch of abnormal samples with reward points meeting preset abnormal conditions based on a training sample library, wherein each training sample comprises sample questions, answers output by a first large language model for the sample questions and reward points, the reward points are obtained by evaluating reasoning steps in the sample answers according to a current reward rule set by using a second large language model, then processing the batch of abnormal samples by using a third large language model to obtain optimization suggestions for the current reward rule set, and then updating the current reward rule set based on the optimization suggestions. The updated current rewards rule set is used for reinforcement learning of the first large language model.

Inventors

WANG YUAN
WEI PENG
GU JINJIE
LIU JUNWEI
CHEN ZHE
LIU JINGNAN
YIN JIAJUN
Liao Xinhao
YU AILING
XIAO HANSONG
ZHOU HUALEI
GUO CHUNXIAO

Assignees

支付宝(杭州)数字服务技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260213

Claims (16)

1. A method of updating a set of reward rules, comprising: Sampling a batch of abnormal samples with reward points meeting preset abnormal conditions based on a training sample library, wherein each training sample comprises sample questions, answers output by a first large language model aiming at the sample questions and reward points, and the reward points are obtained by evaluating reasoning steps in the sample answers by using a second large language model according to a current reward rule set; processing the batch of abnormal samples by using a third large language model to obtain optimization suggestions aiming at the current rewarding rule set; Updating the current rewards rule set based on the optimization suggestion.
2. The method of claim 1, wherein the preset exception condition includes a bonus point being below a lower score threshold and/or a bonus point being greater than an upper score threshold.
3. The method of claim 1, wherein the predetermined exception condition comprises a probability of the second largest language model being generated for the bonus points being less than a probability threshold.
4. The method of claim 1, wherein the preset exception condition comprises a variance between a plurality of reward points corresponding to a plurality of reasoning steps in the answer being greater than a variance threshold.
5. The method of claim 1, wherein the predetermined exception condition comprises an answer being identified by a fourth largest language model as utilizing a rule vulnerability to obtain an excessive reward score.
6. The method of claim 1, wherein sampling a batch of anomaly samples for which the bonus points meet a preset anomaly condition based on a training sample library comprises: Sampling the batch of abnormal samples based on an abnormal sample sub-library in the training sample library.
7. The method of claim 1, wherein processing the batch of anomaly samples using a third large language model to derive optimization suggestions for the current set of reward rules comprises: and inputting the batch of abnormal samples, the current rewarding rule set and task description aiming at the rule optimization task into the third large language model together to obtain the optimization suggestion.
8. The method of claim 1, wherein the optimization suggestion includes one or more of adding new rules, modifying existing rules, and deleting existing rules.
9. A reinforcement learning-based model training method, comprising: Querying a current rewards rule set updated by the method of claim 1; Evaluating the reasoning steps in the historical answers based on the current rewarding rule set by using a second large language model to obtain corresponding rewarding scores, wherein the historical answers are obtained by processing historical questions by using the first large language model; The first large language model is trained based on the historical questions, historical answers, and rewards points.
10. The method of claim 9, wherein a training period of the first large language model is shorter than an update period of the current set of rewards rules.
11. The method of claim 9, wherein evaluating the reasoning step in the historical answers based on the set of reward rules using a second large language model, results in a corresponding reward score, comprising: filling the current rewarding rule set, the historical questions and the historical answers into a preset prompting word template to obtain complete prompting words; and inputting the prompt word into the second biggest language model to obtain the reward points.
12. The method of claim 9, wherein training the first large language model based on the historical questions, historical answers, and rewards scores comprises: And updating parameters of the first large language model based on training samples consisting of the historical questions, the historical answers and the rewards points by adopting a strategy gradient algorithm.
13. An apparatus for updating a set of reward rules, comprising: The system comprises an abnormal sample sampling unit, a reward score evaluation unit and a data processing unit, wherein the abnormal sample sampling unit is configured to sample a batch of abnormal samples with reward scores meeting preset abnormal conditions based on a training sample library, each training sample comprises a sample question, an answer output by a first large language model aiming at the sample question and a reward score, and the reward score is obtained by evaluating an reasoning step in the sample answer according to a current reward rule set by using a second large language model; the optimizing and establishing prediction unit is configured to process the batch of abnormal samples by utilizing a third large language model to obtain optimizing suggestions aiming at the current rewarding rule set; And a reward rule updating unit configured to update the current reward rule set based on the optimization suggestion.
14. A reinforcement learning based model training apparatus comprising: a reward rule querying unit configured to query a current reward rule set, which is updated with the apparatus of claim 13; the reward score prediction unit is configured to evaluate the reasoning steps in the historical answers based on the current reward rule set by using a second large language model to obtain corresponding reward scores; and a policy model updating unit configured to train the first large language model based on the historical questions, the historical answers and the reward points.
15. A computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the method of any of claims 1-12.
16. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-12.

Description

Model training method and device based on updating reward rule set and reinforcement learning Technical Field Embodiments of the present disclosure relate to the field of large model technologies, and in particular, to a method and an apparatus for updating a reward rule set, a method and an apparatus for model training based on reinforcement learning, a computer readable storage medium, and a computing device. Background The large language model (Large Language Models, LLMs) achieves powerful language generation and understanding capabilities by pre-training on massive text data. By virtue of the capability, the LLM is widely applied to various fields of intelligent customer service, content creation, code assistance, education, information retrieval and the like, and the intelligent level of man-machine interaction is remarkably improved. However, pre-trained LLMs have a high degree of uncertainty in the behavior patterns learned over open domain data, and their raw output is difficult to directly meet the stability, safety, and controllability requirements necessary for integration into an actual production system. In particular, LLM may produce content that violates known facts, contains logical contradictions, or conflicts with system security protocols, which seriously hamper the reliable deployment and efficient use of LLM. Therefore, "alignment" is a key technical processing stage, and its core task is to correct these technical defects, and ensure that the output behavior of the model is consistent with the established factual references, logic rules and security boundaries, so as to make it technically feasible to integrate and deploy in actual software systems and hardware platforms. Thus, there is a need for an improved LLM alignment scheme that can meet the higher requirements of practical applications. Disclosure of Invention The embodiment of the specification describes a model training method and device based on reinforcement learning for updating a reward rule set, which can solve the technical problems. According to a first aspect, a method of updating a set of reward rules is provided. The method comprises the steps of sampling a batch of abnormal samples with reward points meeting preset abnormal conditions based on a training sample library, wherein each training sample comprises sample questions, answers output by a first large language model for the sample questions and reward points, and the reward points are obtained by evaluating reasoning steps in the sample answers according to a current reward rule set by using a second large language model. And processing the batch of abnormal samples by using a third large language model to obtain optimization suggestions aiming at the current rewarding rule set. Updating the current rewards rule set based on the optimization suggestion. In one embodiment, the preset exception condition includes the bonus point being below a lower threshold of points and/or the bonus point being greater than an upper threshold of points. In one embodiment, the preset exception condition includes the probability of the second largest language model being generated for the bonus points being less than a probability threshold. In one embodiment, the preset exception condition includes a variance between a plurality of reward points corresponding to a plurality of reasoning steps in the answer being greater than a variance threshold. In one embodiment, the preset exception condition includes an answer being identified by the fourth largest language model as utilizing the rule loophole to obtain an excessive reward point. In one embodiment, sampling a batch of abnormal samples for which a reward score meets a preset abnormal condition based on a training sample library includes sampling the batch of abnormal samples based on an abnormal sample sub-library in the training sample library. In one embodiment, the processing of the batch of abnormal samples by using a third large language model to obtain the optimization suggestion for the current rewarding rule set comprises the steps of inputting the batch of abnormal samples, the current rewarding rule set and task description for rule optimization task together into the third large language model to obtain the optimization suggestion. In one embodiment, the optimization suggestions include one or more of adding new rules, modifying existing rules, and deleting existing rules. According to a second aspect, a reinforcement learning based model training method is provided. The method comprising querying a current set of reward rules that are updated using the method of claim 1. And evaluating the reasoning steps in the historical answers based on the current rewarding rule set by using a second large language model to obtain corresponding rewarding scores, wherein the historical answers are obtained by processing historical questions by using the first large language model. The first large language model is trained bas