Search

CN-121981298-A - Method and device based on reinforcement learning training model

CN121981298ACN 121981298 ACN121981298 ACN 121981298ACN-121981298-A

Abstract

The embodiment of the specification relates to a method and a device based on a reinforcement learning training model, wherein the method comprises the steps of inputting a sample question into a target model to be trained to obtain a first target answer, inputting the sample question into a plurality of trained control models to obtain a plurality of control answers, inputting the sample question into an evaluation model, inputting the sample question, the first target answer and any first control answer in the plurality of control answers into an evaluation model to obtain D relative preference scores of the first target answer on D preset evaluation dimensions relative to the first control answer, conducting aggregation operation on the D relative preference scores of the first target answer relative to each control answer to determine a first reward score corresponding to the first target answer, and updating a parameter value of the target model at least according to the first reward score.

Inventors

  • XU JIAO

Assignees

  • 支付宝(杭州)数字服务技术有限公司

Dates

Publication Date
20260505
Application Date
20260112

Claims (14)

  1. 1.A method of training a model based on reinforcement learning, comprising: inputting a sample question into a target model to be trained to obtain a first target answer, and inputting the sample question into a plurality of trained control models to obtain a plurality of control answers, wherein the sample question comprises text content; inputting the sample questions, the first target answers and any first comparison answer in the comparison answers into an evaluation model to obtain D relative preference scores of the first target answers relative to the first comparison answers in D preset evaluation dimensions; aggregating the D relative preference scores of the first target answers relative to the control answers to determine a first reward score corresponding to the first target answer; and updating the parameter value of the target model at least according to the first reward points.
  2. 2. The method of claim 1, wherein the evaluation model is a pre-trained large language model, wherein the inputting the sample question, the first target answer, and any first control answer into the evaluation model results in D relative preference scores for the first target answer relative to the first control answer in D preset evaluation dimensions, comprising: filling the sample questions, the first target answers and the first comparison answers into a first prompt word template to obtain a first prompt word, wherein the first prompt word template comprises description texts for describing each evaluation dimension and indication texts for indicating an evaluation model to compare and score the two answers according to each evaluation dimension; and inputting the first prompt word into the evaluation model to obtain the D relative preference scores.
  3. 3. The method of claim 1, wherein the evaluation dimension comprises initiative, accuracy, utility, language quality.
  4. 4. The method of claim 1, wherein the relative preference score is a binary score, wherein a first value indicates that the first target response is better than the first control response and a second value indicates that the first target response is less than the first control response.
  5. 5. The method of claim 1, wherein the polymerizing operation comprises: Calculating an arithmetic mean of the respective D relative preference scores for each control response, or And calculating the weighted average value of the D relative preference scores of each control answer according to the preset weight coefficient of each evaluation dimension.
  6. 6. The method of claim 1, wherein the first target answer is one of a plurality of target answers generated by the target model for a sample question, the updating the parameter values of the target model based at least on a first reward score, comprising: Determining a corresponding dominance value of the first target answer according to the relative relation between the first reward score and the average reward score of the target answers; Determining a probability ratio corresponding to a first target answer according to the output probability ratio of the target model and a corresponding baseline model aiming at the first target answer; and determining training loss according to the corresponding dominance values and probability ratios of the target replies, and updating the parameter values according to the training loss.
  7. 7. The method of claim 6, wherein the training penalty further comprises a KL-divergence term for measuring a difference between respective output probability distributions of the target model and baseline model for the sample problem.
  8. 8. The method of claim 1, wherein the number of control models and target models are multimodal large language models, and wherein the sample question further comprises image content.
  9. 9. The method of claim 8, wherein the number of control models and target models are medical domain multi-modal large language models, the textual content comprises medical questions, and the image content comprises medical images.
  10. 10. The method of claim 1, wherein the target model is a multi-round dialog model and the sample question is from any one of a plurality of rounds of dialog.
  11. 11. The method of claim 10, wherein the sample question corresponds to an nth round of dialog, including an append question of the nth round and context content of a previous N-1 round of dialog.
  12. 12. An apparatus for training a model based on reinforcement learning, comprising: The answer generation unit is configured to input a sample question into a target model to be trained to obtain a first target answer, and input the sample question into a plurality of trained comparison models to obtain a plurality of comparison answers, wherein the sample question comprises text content; The preference scoring unit is configured to input the sample questions, the first target answers and any first comparison answer in the comparison answers into an evaluation model to obtain D relative preference scores of the first target answers relative to the first comparison answer in D preset evaluation dimensions; A score aggregation unit configured to aggregate D relative preference scores of a first target response relative to each control response, and determine a first reward score corresponding to the first target response; and a model updating unit configured to update the parameter value of the target model at least according to the first bonus point.
  13. 13. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-11.
  14. 14. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-11.

Description

Method and device based on reinforcement learning training model Technical Field One or more embodiments of the present disclosure relate to the field of machine learning, and more particularly, to a method and apparatus for training a model based on reinforcement learning. Background In recent years, the application of large language models (Large Language Models, LLMs) has evolved significantly, which exhibit powerful language generation and understanding capabilities through self-supervised pre-training on massive text data. However, relying only on pre-training has difficulty ensuring that the model output meets human preferences, ethical specifications, or practical requirements in a particular task or application scenario. To this end, the related art introduced reinforcement learning (Reinforcement Learning from Human Feedback, RLHF) techniques based on human feedback to fine tune and align pre-trained large language models. RLHF build a reward model through human feedback and use the reward signal to guide the reinforcement learning process, thereby guiding the model to generate a response more consistent with human intent. However, the reward model in the related art has the problems of subjective score, large score fluctuation, inconsistent scales of the evaluator in different samples and different times and the like, so that the model does not perform well in complex tasks. Thus, there is a need for a method to better train a model, improving the overall performance of the model. Disclosure of Invention One or more embodiments of the present specification describe methods and apparatus for training models based on reinforcement learning, using more accurate reward signals. In a first aspect, a method for training a model based on reinforcement learning is provided, including: inputting a sample question into a target model to be trained to obtain a first target answer, and inputting the sample question into a plurality of trained control models to obtain a plurality of control answers, wherein the sample question comprises text content; inputting the sample questions, the first target answers and any first comparison answer in the comparison answers into an evaluation model to obtain D relative preference scores of the first target answers relative to the first comparison answers in D preset evaluation dimensions; aggregating the D relative preference scores of the first target answers relative to the control answers to determine a first reward score corresponding to the first target answer; and updating the parameter value of the target model at least according to the first reward points. In some possible embodiments, the evaluation model is a pre-trained large language model, and the inputting the sample question, the first target answer, and any first control answer into the evaluation model, obtaining D relative preference scores of the first target answer relative to the first control answer in D preset evaluation dimensions comprises: filling the sample questions, the first target answers and the first comparison answers into a first prompt word template to obtain a first prompt word, wherein the first prompt word template comprises description texts for describing each evaluation dimension and indication texts for indicating an evaluation model to compare and score the two answers according to each evaluation dimension; and inputting the first prompt word into the evaluation model to obtain the D relative preference scores. In some possible implementations, the evaluation dimensions include initiative, accuracy, practicality, language quality. In some possible embodiments, the relative preference score is a binary score, wherein a first value indicates that the first target response is better than the first control response and a second value indicates that the first target response is less than the first control response. In some possible embodiments, the polymerizing operation comprises: Calculating an arithmetic mean of the respective D relative preference scores for each control response, or And calculating the weighted average value of the D relative preference scores of each control answer according to the preset weight coefficient of each evaluation dimension. In some possible implementations, the first target answer is one of a plurality of target answers generated by the target model for sample questions, and updating the parameter values of the target model based at least on the first reward points includes: Determining a corresponding dominance value of the first target answer according to the relative relation between the first reward score and the average reward score of the target answers; Determining a probability ratio corresponding to a first target answer according to the output probability ratio of the target model and a corresponding baseline model aiming at the first target answer; and determining training loss according to the corresponding dominance values and probab