CN-121598428-B - Method and device for defending large language model context injection attack

CN121598428BCN 121598428 BCN121598428 BCN 121598428BCN-121598428-B

Abstract

The invention discloses a method and a device for defending large language model context injection attacks, and relates to the technical field of artificial intelligent network safety protection. The method comprises the steps of non-uniformly sampling input contexts, extracting a head system instruction area, a tail user query area and a plurality of middle representative text fragments, capturing structural anomalies in a non-semantic understanding mode by calculating text compression ratio characteristic values and N-gram repetition rate characteristic values of the middle fragments, calculating semantic similarity characteristic values of tail query and the middle fragments, carrying out intention call detection, calculating total risk scores, and executing hierarchical non-invasive intervention from adding safety instructions at the upper part and the lower part at foot tail to applying dynamic Logit bias in a model decoding stage according to different risk grades. The invention realizes the safety detection and defense of the ultra-long context millisecond delay, does not need to modify the internal weight of the model, and has strong universality and low false alarm rate.

Inventors

FANG YINGYING
ZHANG JINYU
YUAN CHONGJIE
GUO SHICHAO

Assignees

国泰海通证券股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260127

Claims (7)

1. A defending method of large language model context injection attack is characterized by comprising the following steps: Non-uniform sampling is carried out on the context input into the large language model, a plurality of Token in the front of a command area of a context head system, a plurality of Token in the back of a user query area of a context tail and a plurality of representative text fragments in a context middle area are respectively extracted; Calculating the text compression ratio of each representative text segment according to the ratio of the original text size to the compressed text size of the representative text segment in the context middle area, extracting the maximum value in the text compression ratio of a plurality of representative text segments as a text compression ratio characteristic value, segmenting the plurality of representative text segments in the context middle area into a plurality of continuous word sequences through an N-gram model, carrying out vectorization processing and clustering on all word sequences, counting the number of word sequences in each clustered class cluster, and taking the ratio of the maximum value of the number of word sequences in all class clusters to the number of all word sequences as an N-gram repetition ratio characteristic value; Calculating semantic similarity characteristic values between a plurality of Token at the rear part of the tail user query area and a plurality of representative text fragments in the context middle area; calculating a total risk score according to the text compression ratio characteristic value, the N-gram repetition rate characteristic value and the characteristic value of the semantic similarity of a plurality of representative text fragments in the context region by the following formula : Wherein, the The text compression ratio characteristic value; is a text compression ratio threshold; is a semantic similarity threshold; For semantic similarity feature values between the last number of tokens of the tail user query area and a plurality of representative text fragments of the context middle area, The query vector is represented as a result of which, A malicious content vector representing suspected injection; The characteristic value of the repetition rate of the N-gram; 、、 Is a weight coefficient; Is a smooth transfer function; And if the total risk score is greater than or equal to a preset threshold value, judging that the attack behavior exists, and executing the intervention action of the attack behavior.
2. The method for defending a large language model context injection attack according to claim 1, wherein if the total risk score is greater than or equal to a preset threshold, the step of judging that an attack exists and executing an intervention action of the attack comprises: if the total risk score is larger than or equal to a first preset threshold value and smaller than a second preset threshold value, splicing preset safety strengthening instructions at the tail of the context; if the total risk score is greater than or equal to the second preset threshold, splicing preset security reinforcement instructions at the end of the context, and simultaneously further reducing the generation probability of Token in the predefined compliance vocabulary in a large language model decoding stage.
3. The method for defending against a large language model context injection attack of claim 1, further comprising: If the first sampling detection finds that the text compression ratio characteristic values and the N-gram repetition rate characteristic values of a plurality of representative text fragments in the context middle area deviate from the normal statistical threshold, expanding the representative text fragments by taking the positions in the context as the reference to form a key detection interval, adopting a sampling step length smaller than that of the first sampling in the key detection interval, extracting a larger number of representative text fragments, and carrying out recalculation of the text compression ratio characteristic values and the N-gram repetition rate characteristic values.
4. The method of claim 2, wherein the step of further reducing the probability of Token generation in the predefined compliance vocabulary during the large language model decoding stage comprises: In the large language model decoding stage, negative Logit bias is applied to Token in the predefined compliance vocabulary, and the strength of the Logit bias is dynamically adjusted according to the following formula: Wherein, the For a scaling factor to be configurable, For said second preset threshold value, Scoring the total risk.
5. A large language model context injection attack defending apparatus, comprising: The structured sampling module is used for carrying out non-uniform sampling on the context input into the large language model, respectively extracting a plurality of Token at the front of a context head system instruction area, a plurality of Token at the back of a context tail user inquiry area and a plurality of representative text fragments in a context middle area; The lightweight characteristic analysis module is used for calculating the text compression ratio of each representative text segment according to the ratio of the original text size to the compressed text size of the representative text segment in the context middle area, extracting the maximum value in the text compression ratio of a plurality of representative text segments as a text compression ratio characteristic value, segmenting the plurality of representative text segments in the context middle area into a plurality of continuous word sequences through an N-gram model, carrying out vectorization processing and clustering on all word sequences, counting the number of word sequences in each class cluster after clustering, and taking the ratio of the maximum value of the number of word sequences in all class clusters to the number of all word sequences as an N-gram repetition ratio characteristic value; the intention call detection module is used for calculating semantic similarity characteristic values between a plurality of tokens at the back of the tail user query area and a plurality of representative text fragments in the context middle area; the comprehensive risk assessment module is used for calculating a total risk score according to the text compression ratio characteristic value, the N-gram repetition rate characteristic value and the characteristic value of the semantic similarity of the plurality of representative text fragments in the context region by the following formula : Wherein, the The text compression ratio characteristic value; is a text compression ratio threshold; is a semantic similarity threshold; For semantic similarity feature values between the last number of tokens of the tail user query area and a plurality of representative text fragments of the context middle area, The query vector is represented as a result of which, A malicious content vector representing suspected injection; The characteristic value of the repetition rate of the N-gram; 、、 Is a weight coefficient; Is a smooth transfer function; And the dynamic reasoning intervention module is used for judging that the attack behavior exists and executing the intervention action of the attack behavior if the total risk score is greater than or equal to a preset threshold value.
6. The apparatus for defending against a large language model context injection attack according to claim 5, wherein the dynamic inference intervention module is further configured to: if the total risk score is larger than or equal to a first preset threshold value and smaller than a second preset threshold value, splicing preset safety strengthening instructions at the tail of the context; if the total risk score is greater than or equal to the second preset threshold, splicing preset security reinforcement instructions at the end of the context, and simultaneously further reducing the generation probability of Token in the predefined compliance vocabulary in a large language model decoding stage.
7. The large language model context injection attack defending apparatus of claim 5, further comprising: And the self-adaptive sampling module is used for expanding the representative text fragments by taking the positions of the representative text fragments in the context as the reference to form a key detection interval if the first sampling detection finds that the text compression ratio characteristic values and the N-gram repetition rate characteristic values of a plurality of representative text fragments in the context intermediate region deviate from the normal statistical threshold, extracting a larger number of representative text fragments by adopting a sampling step length smaller than the first sampling in the key detection interval, and re-calculating the text compression ratio characteristic values and the N-gram repetition rate characteristic values.

Description

Method and device for defending large language model context injection attack Technical Field The invention relates to the technical field of artificial intelligent network security protection, in particular to a method and a device for defending large language model context injection attacks. Background With the rapid development of Large Language Model (LLM) technology, the length of the supported contextual windows continues to increase, extending from tens of thousands to hundreds of thousands or even millions of Token levels. This advancement brings about strong application capabilities, and at the same time, also promotes a new type of security threat, a multi-instance jail-break attack. By injecting a large number of forged question-answer examples in the ultra-long context window, an attacker utilizes the context learning capability of the model to induce the model to ignore the built-in security alignment rules, so that harmful contents are output. Currently, defense techniques against such long window attacks face two major core bottlenecks. First, in terms of detection, the conventional method relies on deep semantic coding or attention analysis of the whole text, and for hundreds of thousands of Token inputs, the computational overhead is huge, resulting in a significant increase in first word generation delay, which cannot meet the severe requirements of online services on real-time. Secondly, in the aspect of defense, the traditional method mostly adopts invasive schemes such as modifying internal parameters (such as fine adjustment and attention weight adjustment) of a model, which not only needs white box access rights of the model, but also is difficult to adapt to a main closed source business model and a standardized reasoning engine which are called by an API, so that engineering floor property of a defense means is poor. Therefore, a technical solution for effectively detecting and defending a long context injection attack with extremely low delay without depending on full scan and modifying internal parameters of a model is needed in the art. Disclosure of Invention In view of the above-mentioned drawbacks or shortcomings in the prior art, the present invention provides a method and apparatus for defending a context injection attack of a large language model, which can delay processing an input long text in millisecond level, and detect and block the context injection attack through a standard inference interface, thereby effectively solving the technical problems mentioned in the background art. The invention provides a defending method of a large language model context injection attack, which comprises the following steps: Non-uniform sampling is carried out on the context input into the large language model, a plurality of Token in the front of a command area of a context head system, a plurality of Token in the back of a user query area of a context tail and a plurality of representative text fragments in a context middle area are respectively extracted; calculating text compression ratio characteristic values and N-gram repetition rate characteristic values of a plurality of representative text fragments of the context middle area in a non-semantic understanding mode; Calculating semantic similarity characteristic values between a plurality of Token at the rear part of the tail user query area and a plurality of representative text fragments in the context middle area; calculating a total risk score according to the text compression ratio characteristic values, the N-gram repetition rate characteristic values and the characteristic values of the semantic similarity of the plurality of representative text fragments in the context middle region; And if the total risk score is greater than or equal to a preset threshold value, judging that the attack behavior exists, and executing the intervention action of the attack behavior. In another aspect of the present invention, there is also provided a defending apparatus for a large language model context injection attack, including: The structured sampling module is used for carrying out non-uniform sampling on the context input into the large language model, respectively extracting a plurality of Token at the front of a context head system instruction area, a plurality of Token at the back of a context tail user inquiry area and a plurality of representative text fragments in a context middle area; the lightweight characteristic analysis module is used for calculating text compression ratio characteristic values and N-gram repetition rate characteristic values of a plurality of representative text fragments in the context middle area in a non-semantic understanding mode; the intention call detection module is used for calculating semantic similarity characteristic values between a plurality of tokens at the back of the tail user query area and a plurality of representative text fragments in the context middle area; The comprehensive risk