CN-122020653-A - Large model instruction attack sample generation method and system

CN122020653ACN 122020653 ACN122020653 ACN 122020653ACN-122020653-A

Abstract

The invention discloses a large model instruction attack sample generation method and a system, wherein the method firstly generates a strategy output attack instruction according to a current attack instruction sample; the method comprises the steps of generating a large model, collecting a large model, submitting the generated and output instruction to the large model, collecting a returned result of the large model, carrying out harmful evaluation on the collected returned result of the large model, returning a success or failure label, updating a generation strategy according to a preset reward function, and storing a high-value sample in a sample set. According to the scheme provided by the invention, the attack prompt can be automatically generated and optimized according to the feedback of the large model, the effectiveness and diversity of the attack sample can be improved, and the automatic evaluation of the attack success rate is realized, so that the safety evaluation capability of the large model is improved.

Inventors

CAO SIWEI
CHEN GUANGYONG
WANG XUE

Assignees

公安部第三研究所

Dates

Publication Date: 20260512
Application Date: 20250919

Claims (10)

1. A method for generating a large model instruction attack sample, the method comprising: (1) Generating a strategy output attack instruction according to the current attack instruction sample; (2) Submitting the generated and output instruction to the large model and collecting a returned result of the large model; (3) Carrying out harmful evaluation on the collected returned result of the large model and returning a success or failure label; (4) And updating the generation strategy according to a preset reward function, and storing the high-value sample into a sample set.
2. The method for generating large model instruction attack samples according to claim 1, wherein in the step (1), the action space is constructed by predefined mutation operators, and the calling frequency of each operator in the promt generation is controlled by a strategy weight, so that the countermeasure samples are automatically and structurally generated.
3. The method for generating large model command attack samples according to claim 2, wherein the deep neural network classification model trained based on the harmful information data set in the step (3) is combined with a manual rule engine to automatically determine the harmfulness of the large model returned text, and generates an "attack success/attack failure" label and a confidence score for downstream reward calculation and classifier retraining.
4. The method for generating large model instruction attack samples according to claim 3, wherein in the step (4), the label and the confidence coefficient generated in the step (3) are mapped into a function of positive or negative scalar rewards, so as to complete quantitative evaluation of each anti-hint attack effect, and generate a scalar rewards signal.
5. The method of claim 4, wherein in the step (4), the weights of the mutation operators are automatically updated according to the latest rewarding signal and the historical performance smoothing result, and the sampling probability is dynamically adjusted to quickly converge to the optimal attack path under different model versions and application scenarios.
6. The large model instruction attack sample generation system is characterized by comprising an instruction attack sample generation module, a reward feedback module, an output judgment classifier module and a sample variation strategy module; the instruction attack sample generation module is configured to generate a strategy output attack instruction according to a current attack instruction sample; the rewarding feedback module is configured to interact with the instruction attack sample generation module and the output judgment classifier module in data, can submit the attack instruction generated by the instruction attack sample generation module to the large model, and can generate a corresponding rewarding signal according to the evaluation label generated by the output judgment classifier module; the output judgment classifier module can acquire a large model return result, evaluate the harmfulness and return a success or failure label; The sample mutation strategy module is configured to interact with the instruction attack sample generation module and the rewarding feedback module in data, and can update the attack instruction sample generation strategy based on a preset rewarding function according to the rewarding signal fed back by the rewarding feedback module, and meanwhile, store the high-value sample into the sample set.
7. The large model instruction attack sample generation system of claim 6 in which the instruction attack sample generation module is configured to build an action space by predefined mutation operators and control the frequency of invocation of each operator in the Prompt generation with policy weights, thereby to automatically, structurally generate the challenge sample.
8. The large model instruction attack sample generation system of claim 6 wherein the output decision classifier module is based on a deep neural network classification model trained on a harmful information data set and in combination with a manual rules engine, performs automated harm decision on large model return text, generates an "attack success/failure" tag and a confidence score for downstream reward calculation and classifier retraining.
9. The large model instruction attack sample generation system of claim 8 wherein the reward feedback module is configured to map classifier labels and confidence levels as a function of positive or negative scalar rewards, complete a quantitative evaluation of each anti-hint attack effect, generate a scalar reward signal, and communicate the reward signal to the sample variation policy module in real time, affording the benefit.
10. The large model instruction attack sample generation system according to claim 9, wherein the sample mutation policy module automatically updates each mutation operator weight according to the latest rewarding signal and the historical performance smoothing result, and dynamically adjusts the sampling probability to quickly converge to the optimal attack path under different model versions and application scenarios.

Description

Large model instruction attack sample generation method and system Technical Field The invention relates to the technical field of artificial intelligence system security assessment and AI attack and defense, in particular to a command attack sample generation scheme based on reinforcement learning. Background With the wide application of large models in the fields of natural language processing, intelligent question answering, code generation and the like, the security problem of the large models is increasingly focused. Since the large model (Large Language Models) has a strong language generation capability, if not controlled effectively, harmful, false or improper content may be output, bringing serious security risks. Currently, the industry mainly carries out security assessment on a large model through red team testing and other modes. Team testing typically relies on artificially constructed attack cues (Prompts) to induce the model to generate inappropriate content, thereby evaluating the security capabilities of the model. However, the method has the problems of high construction cost, limited coverage, lagging update and the like, and is difficult to cope with the challenges of rapid iteration of a large model and diversified application scenes. To increase the efficiency of attack hint generation, researchers have attempted to generate challenge samples using automated methods. For example, the BERT-atlack method utilizes a pre-trained BERT model to generate countermeasures, successfully misdirects the target model to make a misprediction, and the generated samples perform well in terms of language fluency and semantic retention. In addition, researchers have also proposed the attack tip generation framework that combines manual and automated methods, and through contextual learning, the large model mimics the human generated tip, thereby improving the quality and diversity of attack tips. However, existing automated attack hint generation schemes still have the following problems: 1. The adaptability of the attack sample is poor, and the fixed attack sample is difficult to adapt to a new model due to the continuous change of the architecture and training data of a large model, so that the attack effect is reduced. 2. The lack of an efficient automated evaluation mechanism is that the accuracy of most of the existing automated tools is low, and most of the existing automated tools also need to rely on manual review to judge whether the attack is successful or not, so that the efficiency is low, and subjective deviation is easy to introduce. 3. Sample generation lacks a feedback mechanism, namely, when an attack sample is generated by the existing method, the existing method lacks an optimization mechanism based on model feedback, and the attack effect is difficult to continuously improve. The method and the system have the advantages that the existing large-model safety evaluation exposes various technical bottlenecks in practical application, firstly, the writing difficulty and accuracy of attack prompts depend on experience or static scripts of safety specialists, so that an antagonism prompt set is difficult to timely cover new model versions and diversified application scenes, the comprehensiveness and timeliness of the evaluation are affected, secondly, a closed loop feedback mechanism is absent between prompt generation and evaluation result judgment, dynamic optimization prompt strategies cannot be output according to actual models, local adjustment can only be carried out by relying on manual compound disks or offline scripts, the efficiency is low, subjective deviation is easy to introduce, and in addition, maintenance and update of an antagonism sample library mainly depend on periodic manual checking operation, new shortages brought by model architecture and training data iteration are not reflected, so that the sample set is lagged behind model development. The defects restrict the deep mining of the security risk of the large model language model by the evaluation system, and reduce the accuracy and reliability of the evaluation. Therefore, a method for automatically generating and optimizing attack prompts according to feedback of a large model is needed to improve effectiveness and diversity of attack samples and realize automatic evaluation of attack success rate, so that safety evaluation capability of the large model is improved. Disclosure of Invention Aiming at the problems of the existing automatic attack prompt generation scheme for large model safety evaluation, the invention aims to provide the large model instruction attack sample generation scheme which can automatically generate and optimize attack prompts according to feedback of a large model, can improve the effectiveness and diversity of attack samples, and can realize automatic evaluation of attack success rate, thereby improving the safety evaluation capability of the large model. In order to achieve the above ob