CN-122021790-A - Training method and system for large model self-iteration

CN122021790ACN 122021790 ACN122021790 ACN 122021790ACN-122021790-A

Abstract

The embodiment of the application discloses a large model self-iteration training method and system, the method comprises the steps of generating a collaborative double-instance architecture, generating a training task and a plurality of candidate outputs corresponding to the training task based on a teacher model instance, generating a plurality of structured supervision signals, constructing a training data set based on the structured supervision signals, optimally training the learner model instance by using a mixed loss target, judging the deviation between the performance improvement level and a preset performance improvement threshold, carrying out security audit on a target flow in a current iteration period, updating relevant parameters of the learner model instance passing the audit to the teacher model instance by using a controlled synchronization strategy, and repeatedly executing the steps until a preset iteration termination condition is reached.

Inventors

ZHU ZHAOPENG
XIAO LIYANG
LIU DI
LIANG ZHAOHUI
LIANG YONGJI
HU YINGFENG
LIU YIJIA
CUI YIQUN
WANG WENQING
Deng Nandie
LIU CHAOFEI

Assignees

华能铜川照金煤电有限公司
西安热工研究院有限公司

Dates

Publication Date: 20260512
Application Date: 20260109

Claims (10)

1. A method for training self-iteration of a large model, comprising: Step S1, carrying out architecture instantiation based on the same pre-training model, and generating a collaborative dual-instance architecture, wherein the dual-instance architecture comprises a teacher model instance and a learner model instance; Step S2, generating a training task and a plurality of candidate outputs corresponding to the training task based on the teacher model example according to a preset course scheduling strategy; Step S3, carrying out automatic multidimensional quality evaluation on each candidate output, generating a plurality of structured supervision signals, and constructing a training data set based on the plurality of structured supervision signals; step S4, based on a training data set, optimizing training is carried out on the learner model instance by using a mixed loss target; Step S5, constructing a verification set independent of the step S2 and the step S3, evaluating the performance improvement level of the learner model instance subjected to optimization training relative to the teacher model instance based on the verification set, and judging the deviation of the performance improvement level and a preset performance improvement threshold; s6, when the deviation meets a preset threshold, carrying out security audit on a target flow in a current iteration period, and updating relevant parameters of the learner model instance passing the audit to the teacher model instance by a controlled synchronization strategy; and S7, repeatedly executing the steps S1 to S6, and generating a continuous self-iteration optimization loop until a preset iteration termination condition is reached.
2. The method of claim 1, wherein said automated multi-dimensional quality assessment of each of said candidate outputs comprises: The automated multi-dimensional quality assessment of each of the candidate outputs is performed by an automated assessment pipeline, The automatic execution evaluation pipeline comprises a rule checker for checking whether the candidate output accords with a predefined format and rule, an executable checker for performing operation verification on code class or executable class output, a retrieval consistency checker for checking consistency of output content and retrieval evidence in a retrieval enhancement scene, and a security screening device for screening harmful, prejudice or privacy leakage content.
3. The method of claim 2, wherein the generating a plurality of structured supervisory signals comprises: the output results of the rule checker, the executable checker, the search consistency checker and the security screening device are fused to generate quality scores of a plurality of structured supervision signals; generating confidence scores for a plurality of structured supervisory signals from the teacher model instance; a pair-wise preference relationship for the plurality of structured supervisory signals is generated based on a quality comparison between the plurality of candidate outputs.
4. A method according to claim 3, wherein the mixed loss objective is a weighted sum of a plurality of loss functions, the mixed loss objective comprising at least: a knowledge distillation loss function constructed by weighting the confidence scores based on the teacher model instance, and A preference alignment loss function with marginal constraint constructed based on the pair of preference relationships; Wherein the marginal constraint is used to force that the output of the learner model instance in the pair-wise preference relationship is judged to be better than the candidate output generated by the teacher model instance for the same task.
5. The method of claim 4, wherein the mixing loss objective further comprises: A KL divergence constraint loss function for minimizing KL divergence between the learner model instance and the teacher model instance output probability distribution; The entropy regularization loss function is used for outputting the entropy of the probability distribution of the maximum learning model instance.
6. The method according to claim 1, wherein the building of the validation set independent of the step S2 and the step S3 and evaluating the performance improvement level of the learner model instance after optimization training relative to the teacher model instance based on the validation set and determining the deviation of the performance improvement level from a preset performance improvement threshold comprises: Obtaining a verification data set, wherein samples in the verification data set are different from samples of the plurality of candidate outputs generated in the step S2 and the step S3; based on the verification set, executing and calculating the performance difference value of the teacher model instance and the optimized learner model under the same evaluation index in parallel, and taking the performance difference value as the performance improvement level; And comparing the performance improvement level with a preset performance improvement threshold to obtain the deviation of the performance improvement level and the preset performance improvement threshold.
7. The method of claim 1, wherein the performing a security audit on the target process in the current iteration cycle based on the deviation satisfying a preset threshold, and updating the relevant parameters of the learner model instance that the audit passes to the teacher model instance with a controlled synchronization strategy, comprises: when the performance improvement level reaches the performance improvement threshold, performing security audit on endogenous data generation, model training and parameter updating processes in a current iteration period; Based on at least one of an exponential moving average, a low-rank parameter differential combining, or a controlled synchronization strategy for hierarchical selective synchronization, after the security audit passes, relevant parameters of the learner model instance are updated to the teacher model instance.
8. The method of claim 1, wherein the iteration termination condition comprises at least one of a preset maximum number of iteration rounds being reached, a performance improvement level being reached for a number of consecutive rounds not reaching the performance improvement threshold, and a preset performance goal being reached on a core evaluation task.
9. A method according to claim 3, characterized in that the method further comprises: And screening high-quality samples with quality scores higher than a preset threshold value from all candidate outputs generated by the teacher model example according to the quality scores, and forming a supervision fine-tuning training set for calculating a knowledge distillation loss function.
10. A large model self-iterative training system, characterized in that it comprises functional modules for implementing the steps of a large model self-iterative training method according to any of claims 1 to 9.

Description

Training method and system for large model self-iteration Technical Field The application relates to the technical field of artificial intelligence and machine learning, in particular to a training method and a training system for large model self-iteration. Background With the wide application of large-scale language models in the fields of natural language processing, code generation, knowledge question and answer and the like, continuous optimization and alignment of models have become key technical challenges. The prior art has the following main problems: First, the manual annotation has high dependency. Traditional Supervisory Fine Tuning (SFT) and Reinforcement Learning (RLHF) processes based on human feedback are highly dependent on manual preference labeling, and have the inherent defects of high cost, poor expansibility, large subjective deviation and the like. As model size and task complexity grow, the bottleneck for manual labeling becomes more apparent. Second, the self-training stability is poor. Existing self-distillation and self-training methods, while capable of reducing human reliance, are prone to solidifying error patterns and inducing capacity degradation in the absence of robust quality filtering and constraint mechanisms, and exhibit the phenomenon of so-called "pattern collapse". Third, there is a lack of uniform advancement metrics. The existing scheme lacks a unified and executable 'progressive' measurement standard and a controlled synchronization mechanism, so that performance oscillation occurs in the iterative process, and stable convergence of training is difficult to ensure. Fourth, safety compliance risk management and control is not enough. Iterative capability amplification is often accompanied by security and compliance risks, and the prior art lacks a security arbitration and auditing mechanism covering the full link of data generation, model training, performance evaluation, parameter synchronization. Fifth, engineering implementation complexity is high. The existing scheme has insufficient suitability and expandability in the practical application of distributed environment, resource limited scene, multi-task migration and the like. Therefore, a self-iterative training solution is needed that can reduce the human dependency, ensure the training stability, provide a measurable progress standard, and ensure safe compliance. Disclosure of Invention The application provides a training method and a training system for large model self-iteration, which are used for solving the defects in the prior art. According to a first aspect of an embodiment of the present application, there is provided a training method for large model self-iteration, including: Step S1, carrying out architecture instantiation based on the same pre-training model, and generating a collaborative dual-instance architecture, wherein the dual-instance architecture comprises a teacher model instance and a learner model instance; Step S2, generating a training task and a plurality of candidate outputs corresponding to the training task based on the teacher model example according to a preset course scheduling strategy; Step S3, carrying out automatic multidimensional quality evaluation on each candidate output, generating a plurality of structured supervision signals, and constructing a training data set based on the plurality of structured supervision signals; step S4, based on a training data set, optimizing training is carried out on the learner model instance by using a mixed loss target; Step S5, constructing a verification set independent of the step S2 and the step S3, evaluating the performance improvement level of the learner model instance subjected to optimization training relative to the teacher model instance based on the verification set, and judging the deviation of the performance improvement level and a preset performance improvement threshold; s6, when the deviation meets a preset threshold, carrying out security audit on a target flow in a current iteration period, and updating relevant parameters of the learner model instance passing the audit to the teacher model instance by a controlled synchronization strategy; and S7, repeatedly executing the steps S1 to S6, and generating a continuous self-iteration optimization loop until a preset iteration termination condition is reached. In some embodiments, the automated multi-dimensional quality assessment of each of the candidate outputs comprises: The automated multi-dimensional quality assessment of each of the candidate outputs is performed by an automated assessment pipeline, The automatic execution evaluation pipeline comprises a rule checker for checking whether the candidate output accords with a predefined format and rule, an executable checker for performing operation verification on code class or executable class output, a retrieval consistency checker for checking consistency of output content and retrieval evidence in a retrieval