CN-121997335-A - Reinforced learning-based big language model jail-breaking test method

CN121997335ACN 121997335 ACN121997335 ACN 121997335ACN-121997335-A

Abstract

The invention belongs to the field of reinforcement learning of artificial intelligence and the field of natural language processing, and discloses a large language model jail-break test method based on reinforcement learning. The continuous composite rewards comprise three parts, namely (1) token-level rejection probability (2) semantic unsafe probability of an external safety model and (3) multi-anchor semantic alignment, so that continuous, dense and stable training feedback is realized. The test model is updated by adopting a group relative strategy optimization reinforcement learning method (Group Relative Policy Optimization), so that the jail-breaking success rate is remarkably improved and the query cost is reduced. The invention has the advantages of stable training, high generalization capability across models, excellent performance on strict safety alignment models and the like.

Inventors

XU YANG
ZHANG LEI
WANG JIAJIE
ZHANG SICONG

Assignees

贵州师范大学
中国信息安全测评中心

Dates

Publication Date: 20260508
Application Date: 20260113

Claims (5)

1. A big language model jail-break test method based on reinforcement learning is used in the field of big language model prompt word jail-break test and is characterized by comprising the following steps: Inputting an initial test text, calling a test model to generate a plurality of groups of candidate suffixes, splicing the candidate suffixes and the initial test text to form a test request, and sending the test request to a target model to obtain output content; calculating continuous composite rewards based on output contents returned by the target model, wherein the continuous composite rewards comprise token-level refusal scores, unsafe probability scores of an external safety model and multi-anchor semantic alignment scores, and combining the three scores in a weighted mode to form a complete rewards signal; Step three, parameter updating is carried out on the test model by adopting a group relative strategy optimization reinforcement learning method, relative dominant signals are constructed through normalization in the group, the rewards are stably amplified by combining importance sampling ratio and a cut-off strategy, and the distribution deviation of the strategies is controlled by utilizing KL regular constraint of a reference strategy; and fourthly, repeating the first step to the third step, and iteratively updating the test strategy until training converges and jail-breaking suffixes which can stably induce the target model to generate unsafe contents are obtained.
2. The method for jail-breaking test of large language model based on reinforcement learning of claim 1, wherein the step one is to test the input content Sending the candidate suffixes into a test model, and generating a plurality of candidate suffixes by the test model And append the suffix to the test input content Forming a test request, and inputting the test request to a target model to obtain model output.
3. The method for jail-breaking test of large language model based on reinforcement learning of claim 1, wherein the second step is that firstly, the output sequence returned to the target model is And carrying out token-level reject score calculation based on semantic analysis, wherein the token-level reject score calculation is used for measuring the tendency of a model to reject vocabulary at a plurality of positions in front of a sequence, and the mathematical expression is as follows: Wherein, the Representing a set of reject dictionaries, In position for output Generating reject words Is a function of the probability of (1), For the weight decreasing with the sequence position, satisfy And is also provided with For emphasizing the reject tendency of the preamble token; Secondly, carrying out safety judgment on the output text through an external safety model to obtain that the output belongs to unsafe "Probability of category, as unsafe probability score, the expression is: Wherein, the Judging whether the content semantic is triggering unsafe categories or not based on the content semantic as a safe classifier; Subsequently, calculating a multi-anchor semantic alignment score for measuring output text and a preset jail-break semantic anchor set The expression of which is: Wherein, the In the case of a text encoder, Representing the degree of cosine similarity, Analysis of principal components The projection matrix is used to project the image, As a parameter of the width of the thermonuclear, The weight is a thermonuclear item weight and is used for realizing attraction alignment and suppression mode collapse in semantic space; Finally, the three parts are weighted and combined to form continuous composite rewards, and the following steps are obtained: Wherein the method comprises the steps of And providing dense and distinguishable training signals for the continuous composite rewards for driving the test model to realize stable and controllable jail-breaking behavior learning.
4. The method for jail-breaking test of large language model based on reinforcement learning of claim 1, wherein the third step is that in the training of the (t) th round, the test model uses the current strategy Generating Candidate suffixes are spliced with corresponding test input contents to form an input sequence Transmitting the input sequence to a target model to obtain an output text And calculates a prize value for each output based on the continuous composite prize function of step 2 The rewards are then normalized within the same round to construct a relative dominance signal in the mathematical form: Wherein, the Is the first The average prize within the population of rounds, As the standard deviation of the population, To prevent the stable term with a denominator of zero, For group relative advantage, is used for measuring the first round of candidates The quality of the average performance of the individual samples relative to the population is then defined as the old strategy And based on the action sequence corresponding to the candidate suffix Constructing an importance sampling ratio: Wherein, the Representing the current policy at the input Generating action sequences Based on the relative advantages of the ratio and the population, constructing a strategy optimization objective function similar to truncated importance sampling: Wherein, the For limiting the amplitude of the deviation of the ratio from 1, Is a cutoff range super-parameter; regularizing the weight for KL; for reference model distribution, for constraining the test strategy from deviating excessively from the original language model, For the Kullback-Leibler divergence, by applying the above objective function to the parameters For a pair of And carrying out gradient descent update test strategies, so that under the guidance of relative dominant signals of groups, the probability of candidate suffixes with higher rewards is improved, the probability of candidate suffixes with lower rewards is restrained, and meanwhile, the generated quality and language fluency are kept through KL regular terms, thereby realizing stable reinforcement learning optimization of the test strategies in the environment of limited query budget.
5. The method for jail-break testing of large language model based on reinforcement learning of claim 1, wherein said step four is specifically, in each training round, testing the strategy of the model after updating Regenerating candidate suffixes, obtaining new output text by inquiring, and calculating corresponding continuous compound rewards To judge the convergence, construct the convergence criterion based on unsafe probability, jail-break success rate or tactic updating amplitude output by the target model, consider convergence when any one of the following conditions is satisfied: Wherein, the In order to reward the change threshold, For the policy drift threshold value, A preset cross-safety model jail-breaking success rate target value, Indicating the unsafe generation success rate based on the judgment of the external safety model, when the test strategy meets any condition, determining that the stable jail-breaking capacity is achieved, stopping the strategy updating training flow, and at the moment, converging the strategy For generating a final set of jail-break suffixes, wherein policies Representing a number of successful strategies, the final set of suffixes is capable of stably inducing the target model to produce unsafe content without further training, thereby enabling jail break testing.

Description

Reinforced learning-based big language model jail-breaking test method Technical Field The invention belongs to the field of reinforcement learning of artificial intelligence and the field of natural language processing, in particular to a strategy optimization method in the field of reinforcement learning and semantic analysis in the field of natural language processing, and more particularly relates to a large language model jail-breaking test method based on reinforcement learning. And more particularly to a Large Language Model (LLM) security test direction. Background Large language models (Large Language Models, LLMs) have been widely deployed in recent years in the fields of dialog systems, decision-making assistance, code generation, etc., and the security controllability of the generation behavior has become a key issue in artificial intelligence system design. In practice, an attacker may guide the model around built-in security constraints by constructing specific prompts, thereby outputting non-compliant content, such an attack behavior is commonly referred to as a jail-break (jailbreak) attack. With the wide deployment of large language models in the fields of question-answering systems, intelligent customer service, content generation, auxiliary decisions and the like, how to detect whether the models have potential risks of avoiding refusal and generating unsafe output when facing malicious prompts becomes a core problem in the construction of model safety systems. By carrying out attack test and countermeasure evaluation on the security of the large language model, reliable basis is provided for improvement of defense strategies, robustness enhancement and security capability quantification. Existing jail break attack methods can be broadly divided into the following three categories: (1) A white box attack method based on gradient optimization. Representative methods include GCG (Gradient-based Candidate Generation), COLD-Attack, etc., which iteratively optimizes input hints or suffixes by accessing Gradient information of a target model or proxy model to attenuate the model's refusal behavior. Such methods, though optimizing at a faster rate, rely on model internal structure, and are difficult to apply to strict scenarios in commercial deployment. The patent application of the application number CN202510906532.X discloses a method for resisting jail-break attack by a large language model based on implicit gradient optimization, which is used for realizing continuous gradient optimization of a resistance token through Gumbel-Softmax technology and improving the success rate of the attack of the large model. The method cannot attack the target model, often depends too much on the internal structure of the model, and has long attack iteration time. (2) Attack methods based on evolution or random search. For example AutoDAN series, TAP (Tree-of-Attacks) and the like, candidate prompts are generated through random variation, population evolution or Tree structure search, for example, patent application No. CN202411138306.3 discloses a method and a system for automatically generating big model jail-breaking prompts based on persuasion skills, the big model is used for automatically generating the big model jail-breaking prompts based on persuasion skills, and different strategies are selected through answers of the big model to change persuasion skills so as to achieve a big model jail-breaking result. The patent application of application number CN20241153724. X discloses a large model attack method, a device, an electronic device and a storage medium, which are applied to a red team attack model to attack a model to be tested by gradually rewriting seed problems for a plurality of times. The method has the advantages that model parameters do not need to be accessed, but the problems of low searching efficiency, high query times, unstable convergence and the like exist, and the method is difficult to play a role in a limited calling environment. (3) An countermeasure prompt generation method based on reinforcement learning. The method comprises RLbreaker, RLTA and other preliminary exploration methods, wherein the basic thought is to construct a reward signal by using model output and generate jail-breaking suffixes through a reinforcement learning strategy. However, the existing reinforcement learning method generally depends on a sparse or binary reward mechanism, and cannot characterize fine granularity characteristics (such as early rejection tendency, semantic consistency, potential hazard and the like) in the jail-breaking process, so that the training process has obvious vibration, difficult convergence and weak mobility. Such as application number 202510481810.1 The patent application discloses a method and a system for generating a big model jail-break attack test sample, wherein the method comprises the steps of constructing a plurality of data sets, training a safety protection remov