CN-122027256-A - Multi-module collaborative large language model jail-breaking method and system based on inconsistent characterization

CN122027256ACN 122027256 ACN122027256 ACN 122027256ACN-122027256-A

Abstract

The invention relates to the technical field of large language model safety protection and countermeasure, in particular to a multi-module collaborative large language model jail-breaking method and system based on inconsistent characterization. The method comprises the steps of S1, an attack initialization process, S2, a disturbance attack generation process, S3, an attack evaluation process, S4, an iteration optimization process, and an iteration optimization process, wherein the system receives a target malicious task or problem description, hives a template according to a thinking chain and generates an initial attack prompt text, S2, a hierarchical disturbance injection mechanism is set, the most disturbance scheme is searched through cross-level and cross-intensity, S3, the attack evaluation process judges whether the generated attack prompts to successfully bypass a guard rail and evaluate the damage effect of the attack on a target large language model, and S4, the iteration optimization process is carried out when the display prompt fails to successfully penetrate the guard rail or the damage is insufficient after penetration, and the hierarchical disturbance injection mechanism reaches the maximum disturbance level. The invention realizes the coordination breakthrough of double defenses aiming at the safety alignment of the guard rail and the large language model.

Inventors

LIU CHUANYI
ZHU ZHIBIN
WANG JUANYI
HUANG WENJIE

Assignees

哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院)

Dates

Publication Date: 20260512
Application Date: 20260129

Claims (10)

1. A multi-module collaborative large language model jail-breaking method based on inconsistent characterization is characterized by comprising the following steps: S1, an attack initialization process, wherein a system receives a target malicious task or a user-defined problem description, and an auxiliary model fills a preset thinking chain hijack template according to a harmful thinking chain and generates an initial attack prompt text; s2, a disturbance attack generation process, namely setting a layered disturbance injection mechanism, and searching the most disturbance scheme through cross-level and cross-intensity, wherein the layered disturbance injection mechanism comprises character disturbance attack generation, a safe continuation mechanism, a self-adaptive harmful content enhancement strategy and a layered scheduler; S3, an attack evaluation process, namely judging whether the generated attack prompts to successfully bypass the guard rail or not, and evaluating the damage effect of the generated attack on the penetration of the target large language model; S4, in the iterative optimization process, when the attack evaluation result shows that the guard rail is not successfully penetrated or the harmfulness is insufficient after penetration, and the layered disturbance injection mechanism reaches the maximum disturbance level, the attack flow enters the reinforcement stage, and further iterative reinforcement is carried out on the attack.
2. The multi-module collaborative large language model jail-breaking method based on inconsistent characterization according to claim 1 is characterized in that in the step S2, character disturbance attack generation is specifically implemented by detecting and disturbing malicious fragments layer by layer at different granularities to generate controllable disturbance input which can destroy classification characteristics of guard rails and recover semantics by a large language model.
3. The multi-module collaborative large language model jail-breaking method based on inconsistent characterization according to claim 2, wherein in the step S2, the character disturbance attack generation process specifically comprises: a1. The injection level selection and the harmful fragment detection are that the initialized attack prompt is subjected to fragment analysis, and the relevant semantic area of attack in the text is positioned, wherein the layered character disturbance injection function is as follows: , Wherein the method comprises the steps of Representing an initial attack tip that is to be presented, Representing the level of the implant, The intensity of the disturbance is indicated and, Is a hierarchical harmful localization operator for identifying the target segment of the local disturbance when When adopting a lightweight model When (1) When the model with larger regulation is called; a2. disturbance intensity selection and character disturbance injection, namely selecting proper disturbance intensity based on guard rail feedback and scheduling strategy after harmful fragments are determined The disturbance method is as follows: When (when) When the method is weak disturbance, the disturbance method adopts a bidirectional text injection mode to reverse the character sequence in the text; When (when) And when the method is used for strong disturbance, the disturbance method adopts a mode of turning text injection, and the text is rotated 180 degrees.
4. The multi-module collaborative large language model jail-breaking method based on inconsistent characterization according to claim 1, wherein in the step S2, the process of the safe continuation mechanism is specifically that after character disturbance attack is generated, semantic neutral and convergent summarization language is automatically added to the generated disturbance text.
5. The multi-module collaborative large language model jail-breaking method based on inconsistent characterization according to claim 4, wherein the safe continuation mechanism in step S2 introduces a safe continuation mechanism Given the perturbed text At its end, a benign semantic closure is appended: , Wherein the method comprises the steps of Indicating a series operation of the device, Is a segment of forward summary independent of specific content.
6. The multi-module collaborative large language model jail-breaking method based on inconsistent characterization according to claim 1, wherein in the step S2, the process of the adaptive harmful content enhancement strategy is specifically that explicit or implicit induced variants are selected through semantic guidance prompts according to the damage degree of model output.
7. The multi-module collaborative large language model jail-breaking method based on token inconsistency according to claim 6, wherein in step S2, the adaptive harmful content enhancement strategy defines two complementary variants, namely an explicit guided variant Reverse engineering instructions can be inserted, implicit boot variants Only the perturbations are preserved, and the final attack cues are constructed as follows: , The constructed Prompt is transferred to the attack evaluation process for carrying out the harm evaluation, if the harm score does not reach the preset threshold value, the model has strong enough capability of resisting character interference, wherein, the method comprises the following steps of For a secure continuation mechanism, For a given perturbed text; representing a variant, guiding the variant from explicit Implicit boot variants And (3) taking the value.
8. The multi-module collaborative large language model jail-breaking method based on token inconsistency according to claim 1, wherein in step S2, the hierarchical scheduler injects hierarchy by searching And disturbance intensity To coordinate all components, the hierarchical scheduler follows a minimum perturbation limited principle, i.e. from the lowest cost configuration Initially, the parameters are gradually raised only when necessary; The situation when necessary includes that in the searching process, if the candidate prompt is intercepted by the guard rail, the hierarchical scheduler raises the parameters to the next configuration, otherwise, if the guard rail allows the prompt but the generated harmfulness score is still lower than the preset threshold value, the hierarchical scheduler further raises and switches to the alternative variant, when the attack prompt score passes through the guard rail and reaches the harmfulness evaluation threshold value, the process is terminated in advance, and the hierarchical scheduler returns a successful countermeasure sample.
9. The multi-module collaborative large language model jail-breaking method based on inconsistent characterization according to claim 1, wherein the step S3 comprises two types of decision logic: b1. the guard rail is evaluated by checking whether the guard rail at the input end returns an interception or risk prompt aiming at the current prompt, and determining whether disturbance effectively avoids detection; b2. And (3) evaluating the harm degree of the content generated by the large language model under the condition that the prompt successfully passes through the guard rail, wherein the harm degree comprises whether the original attack intention is restored and whether the harm degree of the content reaches a preset threshold value.
10. A multi-module collaborative large language model jail-breaking system based on inconsistent characterization is characterized by comprising the following modules: The attack initialization module is used for receiving target malicious tasks or user-defined problem descriptions by the system, filling a preset thinking chain hijack template by the auxiliary model according to the harmful thinking chain, and generating an initial attack prompt text; The disturbance attack generation module is used for setting a layered disturbance injection mechanism and searching the most disturbance scheme through cross-level and cross-intensity, wherein the layered disturbance injection mechanism comprises a character disturbance attack generation sub-module, a safe continuation sub-module, a self-adaptive harmful content enhancement strategy sub-module and a layered scheduling sub-module; the attack evaluation module is used for judging whether the generated attack prompts to successfully bypass the guard rail or not and evaluating the damage effect of the generated attack on the penetration of the target large language model; And the iterative optimization module is used for performing further iterative reinforcement on the attack when the attack evaluation result shows that the prompt fails to penetrate through the guard rail or the hazard is insufficient after penetration and the layered disturbance injection mechanism reaches the maximum disturbance level.

Description

Multi-module collaborative large language model jail-breaking method and system based on inconsistent characterization Technical Field The invention relates to the technical field of large language model safety protection and countermeasure, in particular to a multi-module collaborative large language model jail-breaking method and system based on inconsistent characterization. Background Large language models (such as ChatGPT, deepSeek, gemini) are widely applied to various fields such as search engines, intelligent customer service, content creation and the like by virtue of strong language understanding, reasoning and generating capabilities, and the modes of man-machine interaction are deeply changed. However, the security and reliability problems of large language models are also becoming increasingly prominent, since their training data originates from the open internet and model behavior is driven primarily by probabilistic predictions, large models may generate non-compliance content containing bias, discrimination, false information and even harmful and illegal. This security problem is exploited by more than an otherwise useful attack, by which to trigger a detrimental response. In order to cope with jail-break attacks, the security defense mechanism of the current large language model is constructed around the defense concept of combining inside and outside, and a defense system of 'guard rail and large language model security alignment' is formed. Although many attack modes have been proposed, the existing attack modes are designed for a single defense layer, and little attention is paid to the linkage effect of the attack modes, so that no end-to-end attack scheme capable of penetrating both inner and outer layers of defense is available at present. The characterization asymmetry between the guard rail and the large model is researched, an attack method for breaking through the guard rail and the large model at the same time is provided, the large model and the security vulnerability essence of a protection system of the large model can be understood by a myth, and a evidence basis and direction guidance are provided for the improvement of a subsequent defense mechanism. The current jail-breaking attack technology mainly comprises a template-based jail-breaking prompt generation method, an optimization-based search method, a gradient method based on anti-suffix, a hijack attack method based on thinking chain, a disturbance attack method based on character injection and the like. These techniques achieve a certain effect in breaking through the protection of large language models, but have the following disadvantages: (1) The semantic attack method is easy to be directly intercepted by the input end guard rail; (2) The disturbance of character injection can only bypass the shallow detection of the guard rail, but can be recognized and refused by the strong semantic restoration capability of the large language model; (3) In the face of a 'guard rail + large language model security alignment' defense system, an attack strategy capable of being effectively resisted is lacking; (4) Many methods rely on model gradients or internal structures, and are difficult to apply to black box conditions in real commercial environments; (5) The existing method lacks a multi-stage disturbance mechanism capable of being adaptively upgraded, and the attack effect is unstable. Disclosure of Invention The invention provides a jail-breaking method and a jail-breaking system based on a multi-module collaborative large language model with inconsistent characterization, which aim at a guard rail and large language model safety alignment-oriented defense system, creatively propose a jail-breaking attack method based on the asymmetric characterization of the guard rail and the large model, and realize the coordinated breakthrough of double defenses aiming at the guard rail and the large language model safety alignment. The invention provides a multi-module collaborative large language model jail-breaking method based on inconsistent characterization, which comprises the following steps: S1, an attack initialization process, wherein a system receives a target malicious task or a user-defined problem description, and an auxiliary model fills a preset thinking chain hijack template according to a harmful thinking chain and generates an initial attack prompt text; s2, a disturbance attack generation process, namely setting a layered disturbance injection mechanism, and searching the most disturbance scheme through cross-level and cross-intensity, wherein the layered disturbance injection mechanism comprises character disturbance attack generation, a safe continuation mechanism, a self-adaptive harmful content enhancement strategy and a layered scheduler; S3, an attack evaluation process, namely judging whether the generated attack prompts to successfully bypass the guard rail or not, and evaluating the damage effect of the gen