CN-121998021-A - Reinforcement learning method and system based on COT prompt and self-supervision reward generation

CN121998021ACN 121998021 ACN121998021 ACN 121998021ACN-121998021-A

Abstract

The invention discloses a reinforcement learning method and system based on COT prompt and self-supervision reward generation, comprising the steps of adding a preset number of high-quality COT samples before inputting the prompt to form a reinforcement prompt; the method comprises the steps of generating a plurality of outputs based on an enhancement prompt by using a strategy model, synthesizing the plurality of outputs through an anchor point model to generate a synthesized reference, determining the type of a task based on whether answers of the synthesized reference can be verified by a deterministic program, comparing the consistency of each output with the answers of the synthesized reference through a programmed verifier to generate corresponding reward signals when the task is a verifiable task, generating evaluation criteria for the synthesized reference through the anchor point model when the task is a non-verifiable task, judging the satisfaction degree of each output according to the evaluation criteria by using an independent language model to generate the corresponding reward signals, and updating parameters of the strategy model according to the reward signals of each output. The invention improves the generation efficiency of the reward signal and the adaptability of the model in the complex professional field.

Inventors

TAN QIAO
ZHANG BO
GONG MENGCHUN
SHI WENZHAO

Assignees

神州医疗科技股份有限公司

Dates

Publication Date: 20260508
Application Date: 20251208

Claims (10)

1. A reinforcement learning method based on COT hints and self-supervised rewards generation, comprising: acquiring an input prompt, and adding a preset number of high-quality COT sample data before the input prompt to form an enhanced prompt; Generating a plurality of outputs based on the enhanced prompt using a policy model; Synthesizing the outputs through an anchor point model to generate a synthesis reference, wherein the anchor point model is an initial version of the strategy model and parameters are frozen; determining a task type based on whether an answer to the synthetic reference can be validated by a deterministic procedure; When the task type is an unverifiable task, generating an evaluation criterion for the synthetic reference through the anchor point model, and judging the satisfaction degree of each output according to the evaluation criterion by utilizing an independent language model to generate a corresponding reward signal; and updating parameters of the strategy model according to the corresponding reward signals of each output.
2. The reinforcement learning method based on the COT prompt and the self-supervising reward generation according to claim 1, wherein the step of obtaining an input prompt and adding a preset number of high quality COT sample data before the input prompt to form an reinforcement prompt comprises: Receiving the input prompt provided by a user; generating the preset number of high-quality COT sample data based on expert knowledge of the target application field; and combining the high-quality COT sample data with the input prompt to form the enhanced prompt.
3. The method of reinforcement learning based on COT hints and self-supervised rewards generation of claim 1, wherein the step of utilizing a policy model and generating a plurality of outputs based on the reinforcement hints comprises: Parallel reasoning is performed on the enhanced cues by the strategy model, generating a plurality of different outputs, each output representing one possible solution to the input cues.
4. The method for reinforcement learning based on COT hints and self-supervised rewards generation of claim 3, wherein the step of synthesizing the plurality of outputs by an anchor model to generate a synthetic reference comprises: and carrying out information integration and contradiction reconciliation processing on the plurality of outputs through the anchor point model to obtain consistency information, carrying out detail supplement processing on the consistency information to obtain complete reasoning content, and generating a new comprehensive answer based on the complete reasoning content as the synthesis reference.
5. The method of claim 4, wherein determining whether the answer based on the synthetic reference is validated by a deterministic program comprises: carrying out deterministic program verification on the answer of the synthetic reference through the programmed verifier, and judging whether the answer can pass verification or not according to the result of the deterministic program verification; and determining the task type as a verifiable task when the answer can pass the deterministic program verification, and determining the task type as a non-verifiable task when the answer can not pass the deterministic program verification.
6. The method of claim 5, wherein the step of generating a corresponding reward signal by comparing, by a programmable validator, the answer correspondence of each output to the synthetic reference comprises: Judging whether each output answer is equivalent to the answer of the synthetic reference or not through the programming verifier, and generating a corresponding binary form reward signal; The step of generating evaluation criteria for the synthetic reference through the anchor point model, and performing satisfaction judgment on each output according to the evaluation criteria by using an independent language model to generate a corresponding reward signal comprises the following steps: Generating the evaluation criterion based on the content of the synthetic reference through the anchor point model, executing binary judgment on each output based on the evaluation criterion through the independent language model, and counting the proportion of the evaluation criterion which is met by each output to generate a corresponding reward signal in the form of continuous value.
7. The method of reinforcement learning based on COT hints and self-supervised rewards generation according to any of claims 1 to 6, wherein the step of updating parameters of the policy model based on each output corresponding reward signal comprises: calculating a corresponding dominance function estimate based on each of the output reward signals; Calculating a policy probability ratio between the current policy model and the policy model used when each output is generated; carrying out weighted average on products of the dominance function estimated value of each output and the corresponding strategy probability ratio, and calculating the overall strategy gradient; and updating parameters of the strategy model according to the overall strategy gradient.
8. A reinforcement learning system based on COT hints and self-supervised rewards generation, comprising: The enhancement module is used for acquiring an input prompt, and adding a preset number of high-quality COT sample data before the input prompt to form an enhancement prompt; The processing module is used for utilizing the strategy model and generating a plurality of outputs based on the enhancement prompt; The synthesis module is used for carrying out synthesis processing on the plurality of outputs through an anchor point model to generate a synthesis reference, wherein the anchor point model is an initial version of the strategy model and parameters are frozen; The judging module is used for judging whether the answer of the synthetic reference can pass the deterministic program verification or not and determining the task type; The system comprises a synthesis reference, a generation module, a corresponding reward signal, an anchor point model, a satisfaction degree judgment module and a judgment module, wherein the synthesis reference is used for generating a synthesis reference according to the input data; and the updating module is used for updating parameters of the strategy model according to the corresponding reward signals of each output.
9. An electronic device comprising a processor coupled to a memory, the memory having stored therein at least one computer program that is loaded and executed by the processor to cause the electronic device to implement the method of reinforcement learning based on COT hints and self-supervised rewards generation of any of claims 1-7.
10. A computer readable storage medium, wherein at least one computer program is stored in the computer readable storage medium, which when executed by a processor implements the reinforcement learning method based on COT hints and self-supervised rewards generation as claimed in any one of claims 1 to 7.

Description

Reinforcement learning method and system based on COT prompt and self-supervision reward generation Technical Field The invention relates to the technical field of artificial intelligence and machine learning, in particular to a reinforcement learning method and system based on COT prompt and self-supervision reward generation. Background In recent years, large inference models (Large Reasoning Models, LRMs) exhibit excellent complex inference capabilities through reinforcement learning (Reinforcement Learning, RL) techniques, even with simple reward signals based on rules to achieve deep inference behavior. In the post-training (post-training) phase of large language models (Large Language Models, LLMs), reinforcement learning is considered as a core technology path that pushes models to transition from "generic capabilities" to "specialized capabilities". However, current reinforcement learning training paradigms are highly dependent on demonstration data or fine-annotated reward functions provided by human experts, which exposes significant limitations in open, dynamic, real-world tasks. First, the cost of human supervision grows exponentially with task complexity, especially in scenarios involving higher-order cognitive abilities such as language understanding, multi-step reasoning, etc., where it is difficult for human supervisors to continuously provide consistent and accurate feedback signals. Secondly, artificially set reward mechanisms tend to introduce subjective bias or overly simplified targets, resulting in a "reward hacker" (REWARD HACKING) phenomenon in the model, i.e., the intelligent agent will maximize the reward points rather than actually completing the intended mission targets. In the vertical field of medical professions, the above challenges are more pronounced. The reasoning process of medical decision has the characteristics of strong complexity, undefined steps, large verification difficulty and the like, and is different from mathematical problems, the reasoning process of medical problems, particularly rare disease diagnosis and treatment, is more difficult to verify formally, and meanwhile, the high-quality labeling data in the field is scarce, and the acquisition cost is high. For the post-training process of LLMs, the core bottleneck is the acquisition dilemma of high quality supervisory signals. Whether supervised fine Tuning (Supervised Fine-Tuning, SFT) or human feedback-based reinforcement learning (Reinforcement Learning from Human Feedback, RLHF), the effect is heavily dependent on externally provided "standard answers" or "preference tags". SFT requires massive amounts of annotation data, whereas RLHF relies on human annotations that are expensive and may be inconsistent. In professional scenes such as advanced mathematics, medical consultation, legal document writing and the like, the acquisition of the supervision signals is extremely high in cost and even difficult to realize due to expert knowledge barriers. Accordingly, there is a need to provide a solution to the above-mentioned problems. Disclosure of Invention In order to solve the technical problems, the invention provides a reinforcement learning method and system based on COT prompt and self-supervision reward generation. In a first aspect, the present invention provides a reinforcement learning method based on COT hint and self-supervision reward generation, the method comprising the following steps: acquiring an input prompt, and adding a preset number of high-quality COT sample data before the input prompt to form an enhanced prompt; Generating a plurality of outputs based on the enhanced prompt using a policy model; Synthesizing the outputs through an anchor point model to generate a synthesis reference, wherein the anchor point model is an initial version of the strategy model and parameters are frozen; determining a task type based on whether an answer to the synthetic reference can be validated by a deterministic procedure; When the task type is an unverifiable task, generating an evaluation criterion for the synthetic reference through the anchor point model, and judging the satisfaction degree of each output according to the evaluation criterion by utilizing an independent language model to generate a corresponding reward signal; and updating parameters of the strategy model according to the corresponding reward signals of each output. The reinforcement learning method based on COT prompt and self-supervision rewards generation has the following beneficial effects: According to the method, the enhancement prompt is constructed by introducing the high-quality COT sample, the anchor point model with frozen parameters is utilized to generate the synthetic reference through self-supervision, and the satisfaction judgment mode is adopted for the verifiable task by adopting the programmed verifier and the non-verifiable task by adopting the evaluation criterion, so that the problems of high cost, subjectiv