CN-121980366-A - Attention-shift-based multimode large model countermeasure sample detection method

CN121980366ACN 121980366 ACN121980366 ACN 121980366ACN-121980366-A

Abstract

Aiming at the problems of high cost and poor countermeasure detection performance of the existing countermeasure training, the attention distribution difference between a user instruction and multimodal data of a countermeasure sample and a benign sample is stably described by deeply analyzing the internal attention mechanism of the multimodal large model, so that the reliable distinction between the countermeasure sample and the benign sample is realized; the invention only depends on the existing attention weight information in the model reasoning process, combines the light characteristic degradation and the simple classifier to detect, does not need to introduce an additional large-scale model or conduct countermeasure training and parameter fine adjustment, has the advantages of low calculation cost and low deployment cost, provides a non-invasive and extensible countermeasure security defense scheme in a large model reasoning stage, and can effectively promote the reasoning security and robustness of the model under the countermeasure environment on the premise of not affecting the original functions of the multi-mode large model.

Inventors

CHEN MENG
LU LI
WANG KUN
REN KUI

Assignees

浙江大学

Dates

Publication Date: 20260505
Application Date: 20260114

Claims (8)

1. A method for detecting a multi-modal large model challenge sample based on attention shift, comprising: inputting the multi-modal data and the user instruction into a multi-modal large model for reasoning, and obtaining a response text sequence through autoregressive generation and random sampling decoding; in the multi-mode large model reasoning process, calculating the attention weight from the generated word element to the multi-mode data word element and the user instruction word element in the response text sequence layer by layer and head by head; Respectively calculating and generating layer head attention distribution of the word elements on the multi-mode data and the user instruction based on the attention weights, and constructing an attention offset characteristic; Performing feature dimension reduction processing on the attention offset feature to obtain a low-dimension feature vector; and performing classification prediction based on the low-dimensional feature vector, and outputting the multi-modal data as a detection result of the challenge sample or the benign sample.
2. The attention-shift-based multi-modal large model challenge sample detection method of claim 1, The multimodal data includes voice data or image data, which is a benign sample that is not attacked or a challenge sample constructed by an attacker.
3. The attention-offset-based multimodal large model challenge sample detection method of claim 1 or 2, wherein generating the attention weights of the lemma to multimodal data lemma and user instruction lemma in the layer-by-layer, head-by-head computing response text sequence comprises: in the multi-mode large model autoregressive generating process, calculating an attention weight matrix corresponding to each layer and each attention head in a model backbone network; and taking the generated word elements as query word elements, taking the multi-modal data word elements and the user instruction word elements as key word elements, and extracting corresponding attention weight values from the attention weight matrix.
4. The attention offset based multi-modal large model challenge sample detection method of claim 3 wherein the attention weight is a Softmax normalized attention distribution weight.
5. The attention-shift-based multi-modal large model challenge sample detection method of claim 4, wherein the constructing an attention-shift feature comprises: For each layer of the model And each attention head Calculating and generating attention weight average values of the word elements on all the multi-mode data word elements to obtain a layer head attention moment array of the multi-mode data ; For each layer of the model And each attention head Calculating and generating attention weight average values of the word elements on all user instruction word elements to obtain a layer head attention moment array of the user instruction ; Attention to the layer head moment array And (3) with And carrying out differential operation and flattening to obtain the attention offset characteristic vector.
6. The attention-shift-based multi-modal large model challenge sample detection method of claim 1 or 2 or 4 or 5, wherein the feature dimension reduction process employs a principal component analysis method, comprising: centering the attention offset feature vector; calculating a covariance matrix of the centralized feature vector and decomposing the feature value; And setting the number of principal components, selecting the directions of the principal components of the number as projection bases, and projecting the attention offset feature vectors into a low-dimensional feature space.
7. The attention-shift-based multi-modal large model challenge sample detection method of claim 6, wherein the number of principal components is preferably 2 to obtain a two-dimensional feature vector for challenge sample detection.
8. The attention-shift-based multi-modal large model challenge sample detection method of claim 7, wherein the classification prediction employs a support vector machine classifier that trains based on labeled benign samples and challenge sample features and outputs corresponding class labels during the attack detection phase.

Description

Attention-shift-based multimode large model countermeasure sample detection method Technical Field The invention belongs to the model reasoning stage countermeasure security defense in the field of computer artificial intelligence security, and particularly relates to a multimode large model countermeasure sample detection method based on attention deviation. Background In recent years, with the rapid development of multi-modal understanding and generating technology, a multi-modal large model makes breakthrough progress in speech understanding, image understanding, cross-modal reasoning, complex task execution and other aspects, and is widely applied to scenes such as intelligent assistants, man-machine interaction, automatic content generation and the like. The multi-mode large model usually integrates various input interfaces such as text, voice, image and the like, and has the function capability remarkably enhanced, and simultaneously introduces more complex and hidden countermeasure security problems. The existing research shows that the countermeasure sample aiming at the multi-mode large model is a key technical foundation for prompting various security threats such as injection attack, model jail-breaking attack, model stealing attack and the like. An attacker can induce the model to generate unexpected behaviors through carefully constructed voice, image or cross-modal input under the condition of not changing the model structure and parameters, and seriously threaten the safety and reliability of the model reasoning stage. In order to ensure the reasoning safety of the multi-mode large model in the countermeasure environment, the countermeasure defense and detection technology has important practical significance. However, the existing method still has obvious defects that on one hand, the parameters of the multi-mode large model are large in scale, high calculation and storage cost is required to be paid by adopting an countermeasure training or countermeasure fine tuning mode, and the multi-mode large model is difficult to popularize in actual deployment, on the other hand, the multi-mode large model usually follows an autoregressive text generation paradigm, adopts decoding strategies such as random sampling and the like in an reasoning stage, and the output text has stronger uncertainty, so that stable and repeatable characteristics are difficult to extract from a generation result layer for countermeasure sample detection. Therefore, how to mine highly distinguishable and stable countermeasure sample detection factors without modifying model parameters, design a light-weight and efficient detection method, and provide reliable reasoning stage safety guarantee for a multi-mode large model becomes a technical problem to be solved urgently. Disclosure of Invention Aiming at the countermeasure security hole of the existing multi-mode large model, the invention provides a multi-mode large model countermeasure sample detection method based on attention deviation. The method utilizes the interpretable characteristics of the attention mechanism inside the multi-mode large model, constructs highly distinguishable countermeasure sample detection factors by analyzing the attention distribution difference of the model to the user instruction and the multi-mode data in the reasoning process, and combines the characteristic dimension reduction and the support vector machine classifier to realize stable and efficient detection of the countermeasure sample of the multi-mode large model. The technical scheme of the invention is as follows: the invention discloses a multimode large model countermeasure sample detection method based on attention deviation, which comprises the following steps: inputting the multi-modal data and the user instruction into a multi-modal large model for reasoning, and obtaining a response text sequence through autoregressive generation and random sampling decoding; in the multi-mode large model reasoning process, calculating the attention weight from the generated word element to the multi-mode data word element and the user instruction word element in the response text sequence layer by layer and head by head; based on the attention weight, respectively calculating and generating layer head attention distribution of the lemma on the multi-mode data and the user instruction, and constructing an attention offset characteristic; Performing feature dimension reduction processing on the attention offset feature to obtain a low-dimension feature vector; And performing classification prediction based on the low-dimensional feature vector, and outputting multi-modal data as a detection result of the challenge sample or the benign sample. As a further improvement, the multi-modal data according to the present invention includes voice data or image data, the multi-modal data being a benign sample that is not attacked or a challenge sample constructed by an attacker. As a further improvement