CN-121980563-A - Back door detection method and system based on attention abnormality of text graph model

CN121980563ACN 121980563 ACN121980563 ACN 121980563ACN-121980563-A

Abstract

The invention provides a back door detection method and a back door detection system based on a text-generated graph model attention abnormality, wherein the method comprises the steps of acquiring a data set to be detected and a pre-trained text-generated graph model, wherein the text-generated graph model is obtained by pre-training a data set containing a back door sample; generating an image based on each prompt word text in a data set to be tested by adopting a pre-trained text graph model, extracting a cross-attention force diagram sequence corresponding to each denoising time step of each prompt word text in a preset denoising time step interval, calculating a first judgment parameter and a second judgment parameter of the prompt word text based on the cross-attention force diagram sequence of each prompt word text, and if any one of the first judgment parameter or the second judgment parameter is smaller than or equal to a preset threshold value, taking a sample to which the prompt word text belongs as a back door sample. The invention breaks through the limitation of the traditional static feature detection method, realizes the effective identification of a novel back door attack method with stronger concealment, and improves the robustness and generalization capability of detection.

Inventors

WANG ZHONGQI
ZHANG JIE
SHAN SHIGUANG
CHEN XILIN

Assignees

中国科学院计算技术研究所

Dates

Publication Date: 20260505
Application Date: 20251217

Claims (10)

1. A back door detection method based on a text-generated graph model attention abnormality for identifying back door samples in a data set to be detected, the method comprising: S1, acquiring a data set to be tested and a pre-trained text-to-picture model, wherein the data set to be tested comprises a plurality of samples consisting of prompt word texts and images, and the text-to-picture model is obtained by pre-training a data set comprising a back door sample; S2, generating an image based on each prompt word text in a data set to be tested by adopting a pre-trained text graph model, and extracting a cross-attention force diagram sequence corresponding to each denoising time step of each prompt word text in a preset denoising time step interval, wherein each cross-attention force diagram sequence comprises a cross-attention force diagram corresponding to each Token of one prompt word text in the corresponding denoising time step; s3, based on all cross-attention force diagram sequences of each prompt word text, obtaining a difference value between the total variation of the cross-attention force diagram corresponding to the ending Token of the prompt word text in a preset denoising time step interval and the average value of the total variation of the cross-attention force diagram corresponding to all semantic Token, and taking the difference value as a first judgment parameter; S4, based on all cross-attention force diagram sequences of each prompt word text, obtaining a difference value between the total variation of the ending Token state of the prompt word text in a preset denoising time step interval and the average value of the total variation of all semantic Token states, and taking the difference value as a second judgment parameter; s5, if any one of the first judging parameter or the second judging parameter of the prompt word text is smaller than or equal to a preset threshold value, the sample to which the prompt word text belongs is a backdoor sample.
2. The method according to claim 1, wherein S3 comprises: Acquiring the change quantity of the cross-attention force diagram corresponding to each Token in a preset denoising time step interval, wherein the change quantity of the cross-attention force diagram corresponding to one Token in one denoising time step is the Frobenius norm of the difference value of the cross-attention force diagram corresponding to the current denoising time step and the previous denoising time step; And determining a first judgment parameter of the prompt word text by adopting a first preset method based on the change quantity of the cross-attention force diagram corresponding to each time step in the preset denoising time step interval of all Token of the prompt word text.
3. The method according to claim 2, wherein the first preset method is: Wherein, the A first decision parameter is indicated and a second decision parameter is indicated, And (3) with Sequentially representing the first denoising time step and the last denoising time step in a preset denoising time step interval, The first time of the ending Token of the presentation word text in the preset denoising time step interval The amount of change in the cross attention map corresponding to the individual denoising time steps, Representing the total variation of the cross-attention-seeking diagram of the ending Token of the prompt word text in the preset denoising time step interval, Representing the total number of tokens corresponding to the prompt word text, Representing the first prompt word text in the preset denoising time step interval The semantic Token is at the first The amount of change in the cross attention map corresponding to the individual denoising time steps, And (3) representing the average value of the total variation of the cross-attention force diagram corresponding to all semantic Token of the prompt word text in the preset denoising time step interval.
4. The method according to claim 1, wherein S4 comprises: Constructing a state transition model of the prompt word text based on all attention diagram sequences of the prompt word text, wherein the state transition model is used for predicting the states of all Token in the next denoising time step based on the states of all Token in the previous denoising time step of the prompt word text; acquiring initial states of all Token corresponding to the prompt word text, wherein the initial state of each Token is a Frobenius norm of a cross-attention force diagram corresponding to a first denoising time step of the Token in a preset denoising time step interval; Predicting the state corresponding to each denoising time step of all Token in a preset denoising time step interval one by one based on the constructed state transition model by taking the initial states of all Token as inputs; acquiring state change quantity corresponding to each Token of the prompt word text in a preset denoising time step interval, wherein the state change quantity corresponding to one Token in one denoising time step is a difference value of states of the Token in the current denoising time step and the last denoising time step; And determining a second judgment parameter of the prompt word text by adopting a second preset method based on all state change amounts corresponding to all Token of each prompt word text in each time step in the preset denoising time step interval.
5. The method of claim 4, wherein the state transition model is: Wherein, the The first time of all Token representing prompt word text in preset denoising time step interval The state corresponding to the respective denoising time step, The first time of all Token representing prompt word text in preset denoising time step interval The state corresponding to the respective denoising time step, Is a preset diagonal matrix of L dimension, In order to be able to use the coupling strength parameter, The range of the values is as follows , The representation is based on the first A graph laplace matrix constructed across the attention-seeking graph sequence corresponding to the denoising time step.
6. The method of claim 5, wherein the second preset method is: Wherein, the A second decision parameter is indicated and is indicated, And (3) with Sequentially representing the first denoising time step and the last denoising time step in a preset denoising time step interval, Indicating the end Token within a preset denoising time step interval The state change quantity corresponding to the denoising time step, Indicating the total variation of the Token state of the prompt word text ending in the preset denoising time step interval, Representing the total number of tokens corresponding to the prompt word text, Represent the first The first semantic Token in the preset denoising time step interval The state change quantity corresponding to the denoising time step, And representing the average value of the total variation of all semantic Token states of the prompt word text in a preset denoising time step interval.
7. The method of any of claims 1-6, wherein the preset denoising time step interval comprises a preceding of all denoising time steps of a meridional graph model Is provided.
8. The method according to claim 1, wherein the preset threshold is an interval Any one of the values in (a).
9. A back door detection system for implementing the method of any of claims 1-8, the system comprising a meridional graph model, a cross-attention graph sequence extraction module, a first detection module, a second detection module, and a back door sample decision module, wherein: the text-to-text graph model is used for generating an image based on each prompt word text in the data set to be tested; The cross-attention force diagram sequence extraction module is used for extracting a cross-attention force diagram sequence corresponding to each denoising time step of each prompt word text in the data set to be detected in a preset denoising time step interval from the text-generated diagram model; The first detection module is used for determining an average value of the total variation of the cross-attention force diagram corresponding to the ending Token of the prompt word text and the total variation of the cross-attention force diagram corresponding to all the semantic Token in a preset denoising time step interval based on all the cross-attention force diagram sequences of each extracted prompt word text, and taking the difference value as a first judgment parameter of the prompt word text; the second detection module is used for determining a difference value between the total variation of the ending Token state of the prompt word text and the average value of the total variation of all semantic Token states in a preset denoising time step interval based on all cross-attention force diagram sequences of each extracted prompt word text, and taking the difference value as a second judgment parameter of the prompt word text; The back door sample judging module is used for judging that a sample to which the descriptive text with any one of the first judging parameter or the second judging parameter smaller than or equal to a preset threshold belongs is a back door sample.
10. A computer readable storage medium, having stored thereon a computer program executable by a processor to implement the steps of the method of any one of claims 1 to 8.

Description

Back door detection method and system based on attention abnormality of text graph model Technical Field The invention relates to the technical field of artificial intelligence safety, in particular to the field of backdoor detection for deep learning, and more particularly relates to a backdoor detection method and system based on a text-generated graph model attention abnormality. Background In recent years, the literature diffusion model has been widely used in various fields such as art design, medical health, and general-purpose tasks due to its excellent image synthesis capability. In addition, the successful application of the diffusion model of the text graph also promotes the development of a plurality of open source platforms, attracts a large number of users to upload and download a third party model for secondary development, however, the rapid popularization of the diffusion model of the text graph is accompanied by a rapid rise of risk of a back door, and an attacker can silently implant a text-level back door in a training or fine tuning stage through loss manipulation (Rickrolling, twT and the like) or model editing (EvilEdit and the like). The backdoor can be instantly activated by a specific trigger text while the generation quality of normal prompt words is maintained, so that a model is forced to output malicious images preset by an attacker, and the generation behavior hijacking which is directional, hidden and difficult to perceive is realized. In the face of the text backdoor threat, the current industry still mainly uses a backdoor detection thought facing a discrimination model, and the existing main technical means is as follows, by extracting the characteristics of a frequency domain or a space domain of a backdoor and benign image test samples, scanning different categories and observing the abnormal behavior of the backdoor samples or models to detect, however, compared with the discrimination model, a generation model (such as a text graph diffusion model) does not have fixed category output, the generation result is high-dimensional continuous image distribution, and the classification probability for direct comparison is lacking, so that the traditional category response analysis method is difficult to apply. In addition, in the current defense means for the diffusion model of the context graph, researchers propose a back door sample detection method based on output diversity, and the existing method is mainly based on identifying static anomalies of the back door sample in the aspect of attention structure or output diversity, and although the existing method has a certain effect on the detection effect of the early typical back door sample, the existing method cannot adapt to a more concealed back door attack algorithm which appears in recent years. In view of the foregoing, the existing back door detection schemes rely on static features, and are difficult to deal with the novel hidden attack problem, so that a back door sample detection scheme with more discriminative power and strong adaptability is needed. It should be noted that, the present background art is only for describing the relevant information of the present invention to facilitate understanding of the technical solution of the present invention, but does not mean that the relevant information is necessarily prior art. Where there is no evidence that related information has been disclosed prior to the filing date of the present application, the related information should not be considered prior art. Disclosure of Invention Therefore, an object of the present invention is to overcome the above-mentioned drawbacks of the prior art, and to provide a method and a system for detecting abnormal attention based on a text-generated graph model. The invention aims at realizing the following technical scheme: According to the first aspect of the invention, a backdoor detection method based on attention abnormality of a text graph model is provided and used for identifying backdoor samples in a data set to be detected, the method comprises the steps of S1, obtaining the data set to be detected and a pre-trained text graph model, wherein the data set to be detected comprises a plurality of samples consisting of prompt word texts and images, the text graph model is obtained by pre-training the data set containing the backdoor samples, S2, generating images based on each prompt word text in the data set to be detected by the pre-trained text graph model, extracting a cross-attention graph sequence corresponding to each prompt word text in a preset denoising time step interval, each cross-attention graph sequence comprises a cross-attention graph corresponding to one prompt word text in the corresponding denoising time step, S3, obtaining a total change amount of the prompt word text corresponding to the end of the prompt word in the preset denoising time step interval and a cross-attention graph text corresponding to the p