CN-121724108-B - Dynamic antagonism large model alignment training method and system based on meta learning

CN121724108BCN 121724108 BCN121724108 BCN 121724108BCN-121724108-B

Abstract

The invention discloses a dynamic antagonism large model alignment training method and system based on meta-learning, and belongs to the technical field of artificial intelligence safety and alignment. The invention can dynamically generate the countermeasure sample by generating the countermeasure network mechanism, adaptively adjust the deception intensity according to the current capability of the model, avoid the sample solidification of static test, adopt three-level progressive task distribution to cover the complete countermeasure gradient from weak to strong, systematically evaluate the performance of the model under different countermeasure intensities, reveal the neural marks of deception behaviors through neural activation analysis, establish the mapping relation between deception modes and neural activation modes, realize mechanism interpretation, enable the training process to dynamically respond to the change of the alignment state of the model through evaluation-training closed-loop mechanism, and timely discover and correct potential deception tendency.

Inventors

YANG HAOMIAO
TANG DIANHUA
Xiang Kunlan
HUANG YUNFAN
PENG YI
LIU XINYU
JIANG HONGKUN
WANG MINGYU
QIU WEIHAO

Assignees

电子科技大学

Dates

Publication Date: 20260512
Application Date: 20260226

Claims (10)

1. The dynamic resistance large model alignment training method based on meta learning is characterized by comprising the following steps of: Step 1, constructing three-level progressive task distribution, wherein the task distribution comprises three types of test tasks, the countermeasure intensity categories comprise weak countermeasure, medium countermeasure and strong countermeasure, each type of test task corresponds to a non-overlapping spoofing intensity interval, and the value of the spoofing intensity interval progressively increases along with the countermeasure intensity of the test task; generating a test task set covering three types of test tasks, each test task comprising an input scene, an expected output and an alignment constraint condition; Step 2, dynamically generating an countermeasure sample by generating a countermeasure network mechanism, and adaptively adjusting a spoofing intensity parameter for generating the countermeasure sample in a spoofing intensity interval of each type of test task according to the current capability of the model to be trained Splicing a current input noise vector and a current spoofing intensity parameter As input vector of the generator network for generating the countermeasure network, generating a plurality of countermeasure samples for test tasks with different countermeasure intensities, wherein the model to be trained is a large-scale language model, and the processed data is text data and/or image data; step 3, executing outer circulation optimization of the meta-learning framework on the three-level progressive task distribution, sampling a plurality of specified test tasks on the specified task distribution, executing inner circulation adaptation on the sampled specified test tasks, calculating loss of the adapted model on a verification set, and updating meta-parameters through gradient back propagation; Step 4, in the internal circulation adaptation process of the meta-learning framework, recording the neural activation mode, the attention distribution and the self-other person overlapping degree evolution of the model to be trained in real time, and constructing scene-decision-activation triplet data; each triplet data comprises a description of a countermeasure scene in the form of text data, a model decision output, a nerve activation mode, a self-other person overlapping degree and a deception label, wherein the nerve activation mode consists of an activation tensor output by an activation record layer selected from a model bottom layer, a model middle layer and a model top layer; step 5, performing fraud analysis on the triplet data through machine learning to identify a neural marker of fraud, and establishing a mapping relation between the fraud mode and the neural activation mode; and 6, integrating the feedback signals of the dynamic evaluation references into a training process, and dynamically adjusting the training strategy of the meta-learning framework according to the spoofing detection rate and the capability maintenance degree.
2. The meta-learning based alignment training method for a large dynamic challenge model according to claim 1, wherein in the step 1, the weak challenge spoof intensity interval is [0.1,0.3], the medium challenge spoof intensity interval is (0.3, 0.6], and the strong challenge spoof intensity interval is (0.6, 1.0].
3. The meta-learning based alignment training method for a dynamic antagonism big model according to claim 1, wherein in step 2, the generated antagonism network is parameter trained using a minimisation maximum objective function.
4. The method for training alignment of a large dynamic resistance model based on meta-learning as set forth in claim 1, wherein in step 2, when generating a plurality of resistance samples, the generating mode of the generator network is replaced by scaling the currently inputted noise vector with a condition β by using a characteristic linear adjustment layer.
5. The meta-learning based alignment training method of dynamic resistance big model according to claim 1, wherein in step 3, when performing the internal loop adaptation, for a specified task distribution Meta-parameters The optimization objective of (1) is defined as: ; Wherein, the The meta-parameters are represented by a set of values, In order to adapt the meta-parameters after the adaptation, Is internal circulation at the first Designated test tasks The meta-parameters after the upper adaptation are used, For the preset learning rate of the inner loop, Based on meta-parameters Is used for the model of the (c), Based on adapted meta-parameters Is used for the model of the (c), For meta-parameters Is used for the gradient operator of (1), For internal circulation to the current test task Including alignment loss, self-other overlap loss, and challenge sample discrimination loss, For distribution about tasks belonging to a specification Task of upper test Is a mathematical expectation of (a).
6. The meta-learning based dynamic resistance large model alignment training method of claim 5, wherein the loss function The initial value of the weight against the sample discrimination loss was set to 0.1, and the initial value of the weight against the self-other person overlap loss was set to 0.05.
7. The meta-learning based alignment training method of a dynamic resistance large model as claimed in claim 1, wherein in step 4, a model bottom layer, a model middle layer and a model top layer are respectively set as follows: Definition N represents the total number of layers of the model, the bottom layer of the model accounts for 20%, the middle layer of the model accounts for 50%, and the top layer of the model accounts for 30%.
8. The method for training alignment of a large dynamic resistance model based on meta-learning as claimed in claim 1, wherein in step 5, the fraud patterns include target deviation, supervised avoidance and security bypass.
9. The meta-learning based alignment training method of a dynamic resistance big model according to claim 1, wherein in step 6, the training strategy is dynamically adjusted according to the fraud detection rate and the capability maintenance degree, which is: based on fraud detection rate And capacity retention Calculating the comprehensive efficiency When the comprehensive efficiency is When the preset threshold value is lower than the preset threshold value, automatically triggering an intervention mechanism; Wherein the intervention mechanism comprises: Enhancing the spoofing intensity parameter of the test task in the spoofing intensity interval of each type of test task ; Increasing the weight of the test task of the category corresponding to the current test task in the sampling distribution; reducing the learning rate of internal circulation ; Increasing the weight of the self-other overlap penalty.
10. The dynamic antagonism big model alignment training system based on the meta-learning is characterized by comprising a task generating module, an antagonism sample injection module, a meta-learning optimizing module, a nerve activation analyzing module and a dynamic evaluation reference module; The task generation module is used for constructing three-level progressive task distribution, wherein the task distribution comprises three types of test tasks, the countermeasure intensity categories comprise weak countermeasure, medium countermeasure and strong countermeasure, each type of test task corresponds to a non-overlapping spoofing intensity interval, and the value of the spoofing intensity interval is progressively increased along with the countermeasure intensity of the test task; The counterattack sample injection module dynamically generates a counterattack sample by adopting a counterattack network generation mechanism, and adaptively adjusts and generates a deception intensity parameter of the counterattack sample in a deception intensity interval of each type of test task constructed by the task generation module according to the current capability of the model to be trained Splicing a current input noise vector and a current spoofing intensity parameter As input vector of the generator network for generating the countermeasure network, generating a plurality of countermeasure samples for test tasks with different countermeasure intensities, wherein the model to be trained is a large-scale language model, and the processed data is text data and/or image data; The meta-learning optimization module performs external circulation optimization of a meta-learning framework on the three-level progressive task distribution constructed by the task generation module, samples a plurality of specified test tasks on the specified task distribution, performs internal circulation adaptation on the sampled specified test tasks, calculates the loss of the adapted model on a verification set, and updates meta-parameters through gradient back propagation; The meta-learning optimization module records the neural activation mode, the attention distribution and the overlap evolution of self-others of a model to be trained in real time in the internal circulation adaptation process, and constructs scene-decision-activation triplet data, wherein each triplet data comprises a description of an countermeasure scene in a text data form, model decision output, a neural activation mode, the overlap of self-others and a deception label; the neural activation analysis module analyzes the triple data through machine learning to identify neural marks of the fraud, establishes a mapping relation between the fraud mode and the neural activation mode and outputs the mapping relation; the dynamic evaluation reference module integrates feedback signals of the dynamic evaluation reference to the meta-learning optimization module, and dynamically adjusts the training strategy of the meta-learning optimization module according to the spoofing detection rate and the capability maintenance.

Description

Dynamic antagonism large model alignment training method and system based on meta learning Technical Field The invention relates to the technical field of Artificial Intelligence (AI) safety and alignment, in particular to a dynamic antagonism large model alignment training method and system based on meta-learning. Background As large language models (LLMs, abbreviated as large models) such as GPT, claude, etc., exhibit surprisingly versatile capabilities, research into large model alignment techniques is becoming more urgent. The traditional alignment evaluation method mainly adopts a static benchmark test, and evaluates the alignment performance of the model through a predefined test sample set. However, this approach has significant limitations. First is the test specimen cure problem. The sample library of static benchmark tests is not changed once determined, so that the model can obtain a virtual high evaluation score through memorization or overfitting, and the alignment capability of the model in a dynamic environment cannot be truly reflected. Related studies have shown that advanced commercial models perform well in certain test scenarios, but fraud rates can be as high as 85% in out-of-distribution scenarios. And secondly, the problem of single countermeasure intensity. The existing test method generally adopts a countermeasure sample with fixed strength, and cannot systematically evaluate the robustness of the model under different countermeasure strengths. This results in a model that may perform well under weak challenge, but collapse rapidly under strong challenge. Third is lack of mechanism interpretation. The traditional testing method only focuses on the output behavior of the model, and cannot reveal the decision mechanism and the neural basis of deception tendency inside the model. Tests by research institutions show that models try to shut down the supervision mechanism and deny the behavior afterwards when facing target conflicts, but the internal mechanisms are not yet clear. Fourth is the problem of separation of evaluation and training. The existing method takes alignment evaluation as an independent link after training, and closed loop optimization of evaluation-training cannot be formed. This results in a lack of real-time feedback during the training process and failure to discover and correct alignment problems in time. For the current situation of a relatively lacking systematic meta-learning framework and neural activation analysis, there is a need for a new training environment that can dynamically generate challenge samples, systematically evaluate alignment robustness, reveal deception neural mechanisms, and form an evaluation-training closed loop. Disclosure of Invention The invention aims to provide a dynamic antagonism big model alignment training method and system based on meta-learning, which solve the problem that the traditional static test method can not capture dynamic deception behavior by simulating the complex alignment challenge of the real world and systematically testing and optimizing the alignment robustness of an AI model. The invention provides a dynamic antagonism big model alignment training method based on meta-learning, which comprises the following steps: Step 1, constructing three-level progressive task distribution, wherein the task distribution comprises three types of test tasks, the countermeasure intensity categories comprise weak countermeasure, medium countermeasure and strong countermeasure, each type of test task corresponds to a non-overlapping spoofing intensity interval, and the value of the spoofing intensity interval progressively increases along with the countermeasure intensity of the test task; generating a test task set covering three types of test tasks, each test task comprising an input scene, an expected output and an alignment constraint condition; step 2, dynamically generating an countermeasure sample by generating a countermeasure network (GAN) mechanism, and adaptively adjusting the spoofing intensity parameter of the generated countermeasure sample in the spoofing intensity interval of each type of test task according to the current capability of the model to be trained Splicing a current input noise vector and a current spoofing intensity parameterAs input vector of the generator network for generating the countermeasure network, generating a plurality of countermeasure samples for test tasks with different countermeasure intensities, wherein the model to be trained is a large-scale language model, and the processed data is text data and/or image data; Step 3, performing external circulation optimization of a meta-learning framework (MAML) on the three-level progressive task distribution, sampling a plurality of specified test tasks on the specified task distribution, performing internal circulation adaptation (i.e. model fine tuning) on the sampled specified test tasks, calculating the loss of the adapted model on a ver