CN-121981198-A - Multi-mode large model-oriented universal indirect prompt injection method

CN121981198ACN 121981198 ACN121981198 ACN 121981198ACN-121981198-A

Abstract

The invention discloses a universal indirect prompt injection method for a multi-mode large model, which aims at the irreducible problem brought by mixed modeling of discrete words and continuous features in multi-mode input, introduces a gradient estimation mechanism based on sampling, replaces discrete word selection and embedded index operation by micro probability sampling and matrix multiplication, realizes end-to-end antagonism sample optimization, combines a multi-context expected optimization and attention guidance mechanism, controls attention distribution of the model to multi-mode data in an reasoning stage through implicit and explicit modes, remarkably improves generalization capability of indirect prompt injection under unknown user instructions and complex context conditions, is simultaneously applicable to the multi-mode large model of a continuous representation model, a discrete representation model and a continuous-discrete mixed architecture, and can stably induce the model to generate target response in multi-mode scenes such as voice, images, so as to systematically reveal indirect prompt injection safety risks faced by the multi-mode large model in the reasoning stage.

Inventors

CHEN MENG
LU LI
WANG KUN
REN KUI

Assignees

浙江大学

Dates

Publication Date: 20260505
Application Date: 20260114

Claims (9)

1. The universal indirect prompt injection method for the multi-mode large model is characterized by comprising the following steps of: obtaining multi-modal data as benign samples, initializing countermeasure disturbance as Gaussian noise, and adding the Gaussian noise to the multi-modal data to construct countermeasure samples; Inputting the countermeasures to a modal encoder to obtain a hidden vector sequence, calculating the negative Euclidean distance between each hidden vector and the codebook vector, adding Gumbel noise, performing temperature scaling to obtain probability distribution, and multiplying the probability distribution with a decoder vocabulary embedding matrix to obtain a multi-modal embedding representation; encoding a plurality of text instructions to obtain text embedded representations corresponding to text tokens, inputting each text embedded representation and the multi-mode embedded representation into a decoder to generate a token after splicing, and calculating the cross entropy expectation of the generated token and a target token as a countermeasure loss; Extracting target vocabulary elements for each layer and each attention head of the decoder, summing the attention weights of the multi-mode vocabulary elements, solving the average value layer by layer and head by head, and calculating marginal attention loss; Calculating the L2 distance of the benign sample and the antagonistic sample as a perceived concealment loss; the challenge loss, marginal attention loss and perceived hidden loss are weighted and summed, gradient back propagation is applied to calculate a gradient sign at the challenge sample, and the challenge sample is updated by gradient descent, and the above process is iterated until the maximum number of steps or the target word is successfully generated.
2. The method of claim 1, wherein the multimodal data comprises speech data or image data, wherein the speech data is normalized and mapped to intervals The image data is normalized and mapped to the interval As an unharmed benign sample, the initialization counter-disturbance is random noise of the same dimension as the original multi-modal data, which is subject to a gaussian distribution.
3. The multi-modal large model-oriented universal indirect prompt injection method according to claim 1 or 2 is characterized in that the codebook is a multi-modal large model pre-trained codebook used for vectorizing continuous multi-modal features, negative Euclidean distance between hidden vectors and codebook vectors is taken as logits for discrete word element selection, and a Gumbel-Softmax function is adopted for continuously sampling logits to obtain multi-modal embedding, the multi-modal embedding calculation is carried out, discrete one-hot weights are adopted in a model forward reasoning stage to keep the original reasoning behavior of the model consistent, and continuous probability weights are adopted in a gradient back propagation stage to achieve end-to-end antagonism optimization.
4. The method for injecting a universal indirect prompt for a multi-modal large model according to claim 3, wherein the plurality of text instructions form an auxiliary instruction set for simulating diversified inputs in different user scenes, and the generalization capability of the countermeasure sample under the unknown user instruction condition is improved by joint optimization under the condition of a plurality of different text instruction contexts.
5. The multi-modal large model oriented generic indirect hint injection method of claim 4, wherein the countermeasures are defined as generating expected values of cross entropy loss between a lemma and a target lemma under multiple text instruction context conditions.
6. The method of claim 1, 2, 4 or 5, wherein the marginal attention loss is used to restrict the attention weight of the model to be allocated to the multi-modal vocabulary element when generating the target vocabulary element to be not lower than a preset threshold.
7. The multi-modal large model-oriented generic indirect cue injection method of claim 6, wherein the marginal attention loss calculation comprises: in the reasoning process of the multi-mode large model decoder, aiming at each layer and each attention head in the decoder, extracting the attention weight of a target word pointing to the multi-mode word; Summing the attention weights in the target word element dimension and the multi-modal word element dimension, and carrying out averaging processing according to layers and attention heads to obtain a global multi-modal attention weight; And comparing the global multi-modal attention weight with a preset attention threshold value, and calculating marginal attention loss based on the difference value of the global multi-modal attention weight and the preset attention threshold value to punish the situation that the model is lack of attention to the multi-modal vocabulary when generating the target vocabulary.
8. The multi-modal large model oriented generic indirect hint injection method of claim 7, wherein the perceived disguising penalty is obtained by element-wise subtraction of benign samples from challenge samples and calculating L2 distances.
9. The method of claim 1 or 2 or 4 or 5 or 7 or 8, wherein the counterloss weight is preferably set to 1, and the marginal attention loss weight and the perceived hidden loss weight are set by an ablation experiment.

Description

Multi-mode large model-oriented universal indirect prompt injection method Technical Field The invention belongs to model countermeasure learning in the field of computer artificial intelligence safety, and particularly relates to a universal indirect prompt injection method for a multi-mode large model. Background The multi-mode large model is gradually integrated into various intelligent body frameworks, and is endowed with the capabilities of long-term memory management, tool calling, external environment interaction, multi-round autonomous decision making and the like, so that the autonomy and the influence range of the multi-mode large model in a complex task scene are remarkably improved. In this context, multimodal input channels are increasingly becoming an important portal for attacker manipulation model behavior, with more hidden indirect hint injection attacks (Indirect Prompt Injection) being particularly prominent. The multi-modal indirect prompt injection refers to that an attacker embeds malicious instructions into voice, images or other multi-modal contents which need to be analyzed by the model, and the malicious instructions are not directly provided in the form of explicit text instructions, so that the model can not execute implicit instructions preset by the attacker in the process of understanding multi-modal information. The attack is generally independent of tampering of explicit input of a user, has stronger concealment and deception, and once successful, can induce the multi-mode large model to execute unauthorized operation, reveal sensitive information, erroneously call tools or pollute intelligent memory for a long time, thereby causing serious safety risks to an actual deployment system. The existing research has verified the feasibility of the multi-mode large model indirect prompt injection attack from the theoretical and demonstration aspect. In the image mode, related work mainly embeds hidden instructions into image contents through carefully designed text typesetting, visual layout or disturbance countermeasure to induce a model to execute unexpected behaviors, and in the voice mode, research shows that hidden instructions can be injected into the model by constructing a countermeasure audio sample on the premise of not influencing human perception. However, the existing multi-mode indirect prompt injection method still has obvious limitations that firstly, most of the existing method relies on a gradient optimization process, is only suitable for a model structure represented by continuous multi-mode words, is difficult to directly attack a multi-mode large model with discrete word coding, discrete word segmentation operation or a mixed architecture, and secondly, the existing attack often depends on specific or known context settings due to the fact that the multi-mode large model has high sensitivity to user instruction context, and has obviously reduced attack effect and stability and insufficient generalization capability when facing unknown or changing user instructions. Therefore, how to design a universal indirect prompt injection method which can be simultaneously applied to a continuous representation model, a discrete representation model and a mixed architecture model under the condition of only accessing multi-mode data, and further improve the stability and generalization capability of the method under the condition of unknown user instruction context, becomes an important problem for restricting the effectiveness of threat of indirect prompt injection attack in a real application scene, and has important significance for comprehensively evaluating the safety risk faced by a multi-mode large model in a real countermeasure environment. Disclosure of Invention Aiming at the problems of insufficient universality and generalization capability in the existing multi-mode indirect prompt injection research, the invention provides a multi-mode large model-oriented universal indirect prompt injection method. According to the method, based on a gradient estimation mechanism of sampling, the operation of mapping and embedding the discrete word elements which cannot be micro in the multi-mode large model is replaced by a micro probability sampling and matrix multiplication mode, so that the end-to-end antagonism sample optimization is realized on the premise of not depending on model parameter modification. Furthermore, the invention guides the model to generate stable attention deviation in the generation stage through the control of attention weight distribution between the user instruction and the multi-mode input in the model reasoning process, and remarkably improves the generalization capability of indirect prompt injection in different user instruction contexts by combining with a joint optimization strategy under the multi-context condition. Therefore, the invention realizes a universal and efficient multi-mode indirect prompt injection method suitable for conti