CN-121981201-A - Large language model training method, system and medium based on mixed verification

CN121981201ACN 121981201 ACN121981201 ACN 121981201ACN-121981201-A

Abstract

The invention discloses a large language model training method, a system and a medium based on mixed verification, relating to the technical field of quantum information and artificial intelligence, comprising the steps of constructing a quantum scientific data set and dividing the quantum scientific data set into a first subset and a second subset; the method comprises the steps of performing supervision fine adjustment on a large language model by utilizing a first subset to obtain an initial strategy model after quantum knowledge injection, performing reinforcement learning optimization on the strategy model based on a verification perception rewarding model and a second subset to obtain a target large language model with scientific reasoning capability, wherein the verification perception rewarding model is used for evaluating a plurality of candidate answers generated by the strategy model and outputting a plurality of rewarding values, constructing an optimization target based on the plurality of rewarding values and updating parameters of the strategy model in a mode of maximizing expected return, and the model training method aims at solving rewarding over-optimization and illusion problems caused by lack of training data and physical constraint loss and traditional human feedback reinforcement learning.

Inventors

Qu Songxin
CHEN ZHAOYUN
XUE CHENG

Assignees

合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室)

Dates

Publication Date: 20260505
Application Date: 20260403

Claims (10)

1. The large language model training method based on the mixed verification is characterized by comprising the following steps of: Constructing a quantum scientific data set comprising quantum knowledge questions and answer labels and dividing the quantum scientific data set into a first subset and a second subset; Performing supervision fine tuning on the large language model by using the first subset, thereby obtaining an initial strategy model after quantum knowledge injection; The method comprises the steps of carrying out reinforcement learning optimization on a strategy model based on a verification perception rewarding model and a second subset to obtain a large language model with scientific reasoning capability, wherein the verification perception rewarding model is used for evaluating a plurality of candidate answers generated by the strategy model according to quantum knowledge questions in the second subset, outputting a plurality of rewards fused with a deterministic verification signal and a multidimensional semantic evaluation signal from a scientific execution suite, constructing an optimization target based on the rewards, updating parameters of the strategy model in a mode of maximizing expected returns to improve probability of generating high rewards answers by the strategy model, and the scientific execution suite is a symbol and numerical solver for carrying out mathematical operation and physical constraint deterministic verification.
2. The method according to claim 1, characterized in that the construction process of the quantum scientific data set is specifically: Expanding the theme breadth based on the seed evolution paradigm to obtain initial data; Processing the initial data by adopting a task self-adaptive strategy, namely generating a concise answer label for a retrieval intensive task, generating an answer label and a thinking chain comprising a detailed deduction step for a complex reasoning task; Performing double hybrid verification on the processed data, specifically: The first layer of automatic mixed verification, namely performing deterministic verification of physical consistency and mathematical correctness on the processed data by utilizing a scientific execution suite, and performing semantic evaluation of logic and format by utilizing an independent large language model; a second layer of man-machine collaborative audit, wherein the second layer of man-machine collaborative audit is used for performing manual rechecking of layered sampling on the data passing through the first layer verification based on the difficulty label; Data passing through the dual hybrid verification filtering mechanism is taken as a quantum science data set.
3. The method according to claim 2, wherein the second layer of human-machine co-audit comprises a quality feedback closed loop mechanism, in particular: setting a batch rejection threshold; Counting verification error rate of hierarchical sampling of the current batch data, and judging that the current batch data is invalid if the verification error rate exceeds the batch rejection threshold; and analyzing the error mode of the failure data, correcting the generation instruction in the seed evolution normal form according to the error mode, and triggering the regeneration flow of the initial data.
4. The method according to claim 1, wherein the process of constructing the verification sense reward model is: Configuring a scientific execution suite for executing deterministic verification on verifiable dimensions in a plurality of candidate answers generated by a strategy model, and outputting a verification indicator; constructing a double-head parallel prediction network based on a pre-trained transducer encoder, wherein the double-head parallel prediction network comprises a shared encoder backbone network, a multidimensional scoring head and a dynamic weight distribution head; the shared encoder backbone network generates upper and lower Wen Biaozheng based on the input questions and corresponding candidate answers; The multidimensional scoring head maps the context representation into semantic evaluation scores for evaluating the generated semantic quality; The dynamic weight distribution head splices the context representation and the verification indicator and outputs the dynamic weight of each evaluation dimension; a dynamic rewards calibration mechanism, which is to calculate a fusion score by utilizing a semantic evaluation score and a confidence coefficient for each evaluation dimension; And calculating a final rewarding value by carrying out weighted aggregation on the dynamic weights and the fusion scores.
5. The method of claim 4, wherein the fusion score The calculation formula of (2) is as follows: ; Wherein, the Is that Is used to adjust the coefficient of confidence in the system, Is the first The semantic evaluation score of the individual dimensions, Is the first Verification indicators for the individual dimensions; the confidence coefficient of adjustment The method meets the following conditions: When (when) In the time-course of which the first and second contact surfaces, Verifying that the perceived rewards model fully adopts deterministic rewards; When (when) In the time-course of which the first and second contact surfaces, ; When (when) In the time-course of which the first and second contact surfaces, Is a preset value close to 0.
6. The method according to claim 4, wherein the verification sense reward model is initialized by Oracle-guided distillation training, in particular: Generating a soft target score and an ideal sample importance weight for sample data containing a scientific execution suite execution result by using the heterogeneous large language model set as a referee; and training the verification perception rewarding model by taking the soft target score and the ideal sample importance weight as supervision signals through a multi-task loss function, so that the verification perception rewarding model fits the evaluation distribution of the referee and aligns the discrimination preference of the referee.
7. The method according to claim 1, wherein the division into a first subset and a second subset is in particular: Directly dividing the quantum scientific data set into a first subset and a second subset; Or alternatively Constructing an auxiliary data set containing general field instruction data; mixing a preset amount of data in the quantum scientific data set and the auxiliary data set in proportion to construct a first subset; and mixing the preset quantity of data in the rest quantum scientific data set, the auxiliary data set and the preset quantity of high-difficulty long-chain reasoning samples in the first subset in proportion to construct a second subset.
8. The method according to claim 1, wherein in reinforcement learning optimization of the policy model based on the verification perceived rewards model and the second subset, specifically comprising: performing dominance function estimation based on the rewards value output by the verification perception rewards model, wherein the calculation of the rewards value gives a weight coefficient of a high Yu Duowei semantic evaluation signal to a deterministic verification signal from a scientific execution suite; Constructing a near-end strategy optimization objective function comprising a truncated function, wherein the objective function takes the advantage function as an optimization guide; finally, parameters of the policy model are updated by maximizing the objective function.
9. A computer system comprising a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to implement the method of any one of claims 1-8.
10. A computer readable storage medium, characterized in that it has stored thereon a number of programs for being called by a processor and performing the method according to any of claims 1-8.

Description

Large language model training method, system and medium based on mixed verification Technical Field The invention relates to the technical field of quantum information and artificial intelligence, in particular to a large language model training method, system and medium based on mixed verification. Background In recent years, large language models (Large Language Models, LLMs) based on the transducer architecture have made breakthrough progress in natural language understanding and generation tasks. Existing generic large language models have excellent intent understanding and text generation capabilities through the pre-training of massive data followed by Supervised Fine Tuning (SFT) and reinforcement learning based on human feedback (RLHF). However, when a large language model is applied to the vertical science field typified by quantum mechanics, high-energy physics, the existing technical scheme faces serious challenges. Scientific reasoning tasks are different from daily conversations, which require a large language model with extremely high logic tightness, symbol calculation accuracy and strict compliance with physical laws. In this scenario, the prior art has the following significant drawbacks and disadvantages: 1. the extremely scarcity of field high quality process data (DATA SCARCITY), which is the primary bottleneck restricting the development of scientific large language models. The existing open source data set (such as GSM8K and MATH) is mainly concentrated in first mathematics or general common sense, and lacks professional data aiming at high-order physical fields such as quantum mechanics. Existing physical data typically contains only "questions" and "final answers", severely lacking a tightly verified, step-by-step intermediate reasoning process. Unstructured knowledge is difficult to utilize, and although a great deal of physical knowledge exists in textbooks and papers, the unstructured knowledge exists in an unstructured natural language form, and a large language model is difficult to internalize a strict operator operation rule or state evolution logic directly through self-supervision learning. The manual labeling cost is too high, the fields such as quantum mechanics and the like have extremely high cognitive threshold, and large-scale process-level labeling is performed by depending on field experts, so that the method is not feasible in time and economic cost. 2. Failure of existing reward models (Reward Model) in the professional world, the currently prevailing RLHF training paradigm relies on the reward models to provide an optimized signal, but this is difficult to work with in the scientific world: Sparsity of result supervision, rewarding only the final answer may result in too sparse a signal. For complex quantum computing questions, even if the final answer happens to be correct, the intermediate process may be spurious, which may mislead the direction of optimization of the model. Limitations of human preferences traditional reward models mainly fit human subjective preferences (e.g. usefulness of replies, safety) and lack the ability to judge objective physical truth values. The reward model itself also lacks quantum mechanical knowledge, and the formula deducing errors in the reasoning step cannot be accurately identified. Verification and training split-while the prior art attempts to introduce external calculators or solvers as auxiliary tools, these tools are typically only used as an aid to the inference stage, failing to translate their verification results into valid gradient signals (GRADIENT SIGNAL) back-propagated to the large language model, resulting in a failure to fundamentally improve the scientific inference capabilities of the large language model itself. In summary, how to overcome the scarcity of high-quality process data, to construct a training method capable of automatically verifying the reasoning step and converting the domain knowledge into dense rewarding signals is a key technical problem to be solved in the current scientific large language model research. Disclosure of Invention Based on the technical problems in the background art, the invention provides a large language model training method, a large language model training system and a large language model training medium based on mixed verification, and aims to solve the problems of sparse training data, physical constraint loss and excessive optimization rewarding and plausibility caused by traditional human feedback Reinforcement Learning (RLHF) in the strict scientific fields of quantum mechanics and the like of the existing large language model. The invention provides a large language model training method based on mixed verification, which comprises the following steps: Constructing a quantum scientific data set comprising quantum knowledge questions and answer labels and dividing the quantum scientific data set into a first subset and a second subset; Performing supervision fine