CN-122020144-A - Jail-break attack detection and mitigation method and device based on multi-level embedded analysis

CN122020144ACN 122020144 ACN122020144 ACN 122020144ACN-122020144-A

Abstract

The invention discloses a jail-break attack detection and relief method and device based on multi-level embedded analysis, including multi-level embedded computing and key layer localization, anchor vector generation, jail-break input decisions and anchor vector updating, and diffusion model driven jail-break mitigation. The multi-level embedded calculation breaks through the limitation of the traditional shallow detection by calculating the L2 norm difference and singular value decomposition main component extraction of each embedded layer, accurately identifying and relieving jail-break attack. The attack detection benchmark is continuously optimized. The method blends the recently suffered jail-breaking attack into the anchor vector, so as to continuously optimize the attack detection standard and improve the capturing capacity of the novel jail-breaking attack mode. And learning the mapping relation of the abnormal embedding into the safe embedding through a U-Net variant architecture diffusion model. When an attack is detected, the model predicts noise disturbance of the anomaly layer and performs denoising, and the jail-break embedding is restored to benign semantic representation.

Inventors

WANG XUN
Shou Xuemian
QIAN PENG
LI YUFENG

Assignees

浙江工商大学

Dates

Publication Date: 20260512
Application Date: 20260414

Claims (10)

1. A jail-break attack detection and mitigation method based on multi-level embedded analysis is characterized by comprising the following steps: S1, calculating benign average embedding and harmful average embedding of each layer of a large language model, and calculating the average embedding and the input L2 norm difference to determine a harmful key layer and a jail-break key layer; S2, calculating benign embedment and a difference matrix of the harmful embedment in a harmful key layer and a jail-break key layer, and extracting main components by using singular value decomposition to obtain a harmful anchor vector and a jail-break anchor vector; S3, constructing a difference matrix of last token embedding and benign average embedding of harmful key layer input, constructing a difference matrix of last token embedding and harmful average embedding of jail-break key layer input, respectively carrying out singular value decomposition on the two difference matrices to obtain a harmful main component and a jail-break main component; S4, constructing a diffusion model for relieving jail-break attacks, constructing a data set to train the diffusion model, using the trained model to denoise a jail-break key layer embedded as T-step noise, and generating a sample close to real data distribution to realize jail-break relief.
2. The jail-break attack detection mitigation method based on multi-level embedded analysis of claim 1, wherein the S1 comprises: Calculating the last token embedding of each benign input sample in the first layer, and carrying out average calculation; calculating the last token embedding of each harmful input sample in the first layer, and carrying out average calculation; And selecting the layer number with the largest L2 norm between the input and the benign average embedding and larger than a preset threshold as a harmful key layer, and selecting the layer number with the largest L2 norm between the input and the harmful average embedding and larger than the preset threshold as a jail-breaking key layer.
3. The jail-break attack detection and mitigation method based on multi-level embedding analysis according to claim 1, wherein the harmful anchor vectors are obtained by constructing a benign embedding and a difference matrix of harmful embedding in a harmful key layer, and then using singular value decomposition, selecting a first column of right singular vectors as the harmful anchor vectors; The jail-breaking anchor vector is obtained by constructing a difference matrix of jail-breaking embedding and harmful embedding in a jail-breaking key layer and decomposing singular values; The harmful main component carries out singular value decomposition on a difference matrix by inputting the difference matrix of the embedding of the last token in the harmful key layer and the average embedding of benign input in the harmful key layer to obtain the main component; the jail-breaking main component is obtained by carrying out singular value decomposition on a difference matrix through the embedding of the last token input in the jail-breaking key layer and the average embedding of harmful input in the jail-breaking key layer.
4. The method for detecting and alleviating jail-breaking attack based on multi-level embedded analysis according to claim 1, wherein the step of judging whether the input sequence forms the jail-breaking attack is specifically as follows: The method comprises the steps of calculating cosine similarity between a harmful main component vector and a harmful anchor vector as a first judgment parameter, calculating cosine similarity between a jail-break main component vector and a jail-break anchor vector as a second judgment parameter, and determining the sensitivity and the specificity respectively by a Johnson index when the first judgment parameter is larger than the sensitivity and the second judgment parameter is larger than the specificity, wherein the sensitivity and the specificity are the accuracy of a target sample to be detected and the accuracy of a non-target sample not to be detected.
5. The method for detecting and alleviating jail-breaking attacks based on multi-level embedded analysis of claim 1, wherein the diffusion model for alleviating jail-breaking attacks comprises: The method comprises the steps of adopting a U-Net variant architecture as a diffusion model backbone, comprising an encoder, a bottleneck layer and a decoder, introducing time-step embedding conditions into each layer, enhancing semantic fidelity through residual connection, replacing a convolution layer of a traditional U-Net with a full-connection layer, introducing time-step embedding into each layer of network, and adding jump connection between corresponding layers of the encoder and the decoder.
6. The jail-break attack detection and mitigation method based on multi-level embedding analysis of claim 5, wherein the time-step embedding is to randomly select one time-step from a preset time-step range, construct a corresponding time-step embedded vector through sine or cosine coding, specifically select a sine time-step code when the dimension index of the embedded vector is even, and select a cosine time-step code when the dimension index of the embedded vector is odd.
7. The jail-break attack detection and mitigation method based on multi-level embedded analysis of claim 1, wherein the data set in the diffusion model training process is constructed as follows: based on benign inputs, harmful inputs, and jail-break inputs in the training sample set; benign embedding with last token of benign input passing through layer 0 as step 0 The last token which is input through the first layer is used as a t-step jail-breaking embedding; the current layer number is considered as a time step to construct a training set comprising benign embedding of step 0, jail-break embedding of step t and time step.
8. The method for detecting and alleviating jail-breaking attack based on multi-level embedding analysis according to claim 1, wherein the denoising of the jail-breaking key layer embedding judged to be the jail-breaking attack as the T-step noise is specifically as follows: After the jail-breaking attack is detected, positioning a jail-breaking key layer j, regarding the corresponding embedding as T-step noise, carrying out iterative denoising according to the noise predicted by the model, adding a random noise item in each iterative process, and gradually generating a high-quality sample close to real data distribution from the random noise through an iterative formula.
9. A jail-break attack detection mitigation device based on multi-level embedded analysis, comprising a memory and one or more processors, the memory having executable code stored therein, wherein the processor, when executing the executable code, implements a jail-break attack detection mitigation method based on multi-level embedded analysis as claimed in any one of claims 1 to 8.
10. A computer readable storage medium having a program stored thereon, wherein the program when executed by a processor implements a jail-break attack detection mitigation method based on multi-level embedding analysis according to any of claims 1 to 8.

Description

Jail-break attack detection and mitigation method and device based on multi-level embedded analysis Technical Field The invention relates to the technical field of artificial intelligence safety, in particular to a jail-break attack detection and relief method and device based on multi-level embedded analysis. Background In recent years, with the wide application of large language models in scenes such as dialog generation, code writing, content creation and the like, jail-break attacks against models have gradually become an important threat in the field of artificial intelligence security. Jail-break attacks refer to the generation of harmful, sensitive or use policy-violating response content by an attacker by carefully designing input prompts, inducing a model to bypass the original alignment policy and security boundaries. Such attacks are often disguised as ordinary interactive instructions, have extremely strong concealment and misleading, greatly increase the detection difficulty, and may bring serious ethical and security consequences. Jail-break attacks refer to the fact that an attacker induces a model to bypass a built-in security rule or an alignment mechanism by constructing ingenious prompts, and outputs illegal or sensitive contents which are originally limited, such as violent threat, ethnic discrimination, violation guidance, privacy disclosure and the like. The attack is usually independent of model parameter access, can trigger high-risk output only by input sentences, has the characteristics of strong black box property, flexible form and high detection difficulty, and constitutes a substantial threat to wide floor application of a large language model. In response to this challenge, studies have been made to propose solutions at both the detection and defense level, such as keyword filtering, refusal templates, security trimming, challenge training, etc., in an attempt to suppress jail-break behavior from the input control or behavior specification level. However, in the context of increasingly complex jail-break attacks, suggesting the evolving design, the prior art systems still face a number of challenges: The existing jail-break detection method has the defects that: 1) The modeling granularity is coarse, and deep jail-breaking prompts are difficult to identify, wherein most jail-breaking detection methods adopt shallow strategies based on word list or pattern matching, and can not identify violation intention hidden in the prompts in the modes of logic nesting, semantic blurring, role playing and the like, so that a large number of gauge avoidance prompts are missed. 2) Ignoring representation differences inside the model, the detection robustness is not that the current detection is based on input text or output response characteristics, and the deep modeling of intermediate representations (such as embedded layers and multiple layers of attention) of the model in the processing process is lacked, so that systematic deviation between benign input and jail-break input on a representation path is difficult to capture. 3) The generalization capability is poor, the evolution of an attack strategy is difficult to adapt, the trigger mode of jail-break attack shows a trend of high diversity and rapid evolution, the existing detection model often depends on static data training, a dynamic adaptation mechanism is lacking, and the recognition capability of a novel attack sample is obviously reduced. Meanwhile, some problems are faced in the jail-break relief part: 1) The semantic intervention mechanism is missing, and most defense methods cannot effectively intervene in the intermediate representation of the model, so that even if jail-break behavior is detected, the continuous propagation of the jail-break behavior in the generation path cannot be prevented 2) Lack of controllable restoring ability, lack of "error correction-regeneration" mechanism upon deviation of model response, and easy occurrence of semantic drift or semantic break even if countermeasure training or interpolation intervention is introduced with undefined influence range Disclosure of Invention The invention aims at solving the problems of limitation in the existing large language model jail-break attack detection and defense method, particularly rough detection granularity, insufficient model internal representation modeling, poor release effect and the like, and provides a large language model jail-break attack detection and release method and system based on multi-level embedded analysis. The method aims to realize a general large language model safety protection scheme with strong deep threat identification capability, accurate model representation modeling and good defense effect, and is used for improving the safety robustness of a main stream large language model. The invention aims at realizing the technical scheme that the jail-break attack detection and relief method based on multi-level embedded analysis co