CN-121980270-A - Method, device, processor and medium for generating anti-attack sample based on cross-modal consistency disturbance generated visual language model

CN121980270ACN 121980270 ACN121980270 ACN 121980270ACN-121980270-A

Abstract

The invention relates to a method for generating a visual language model anti-attack sample based on cross-modal consistency disturbance generation, which comprises the following steps of preprocessing, extracting respective characteristic information, obtaining model intermediate representation and cross-modal associated information, constructing a joint objective function L total , and calculating gradients of image and text modes respectively by using back propagation And The method comprises the steps of realizing cross-modal gradient feedback, iteratively updating until a termination condition is met, outputting a countermeasure sample after disturbance superposition, and evaluating the countermeasure effect through calculating the confidence coefficient and the misjudgment rate output by a model. The method, the device, the processor and the computer readable storage medium for generating the anti-attack sample by the visual language model based on cross-modal consistency disturbance, which are disclosed by the invention, are adopted, so that the overall success rate of the anti-attack is remarkably improved, the discrimination capability of the target model is effectively destroyed, and the application range of the anti-attack technology in the visual language multi-modal task is expanded.

Inventors

WANG XIAOLU
XIA YAN
WANG LI
LI SHUJUAN
LI XIAOFAN

Assignees

公安部第三研究所

Dates

Publication Date: 20260505
Application Date: 20260202

Claims (12)

1. A method for generating a challenge sample based on a visual language model generated by cross-modal consistency disturbance, the method comprising the steps of: (1) Preprocessing an input image x and a corresponding text y, and extracting respective characteristic information; (2) Inputting the image features and the text features into a visual language model to obtain model intermediate representation and cross-modal associated information; (3) Constructing a joint objective function L total ; (4) Computing gradients of image and text modalities respectively using back propagation And ; (5) The cross-modal gradient feedback is realized, namely, the text gradient is fed back to the image disturbance updating formula after normalization processing, and the image gradient is fed back to the text disturbance updating formula after normalization processing; (6) Repeating the iterative updating of the step (4) and the step (5) until the termination condition is met, and outputting the countermeasures after the superposition disturbance; (7) And after post-processing the output countermeasure sample, inputting a target visual language model for verification, and evaluating the countermeasure effect by calculating the confidence coefficient and the misjudgment rate output by the model.
2. The method for generating a challenge sample based on a visual language model generated by cross-modal consistency disturbance according to claim 1, wherein the preprocessing in the step (1) includes normalization, resizing, word segmentation and vectorization of the image x and the text y.
3. The method of claim 1, wherein the model intermediate representation in step (2) comprises a multi-scale feature representation of the image and a semantic embedded vector of the text, and the cross-modal correlation information is obtained by calculating correlation coefficients between the image features and the text features.
4. The method for generating a challenge sample based on a visual language model generated by cross-modal consistency disturbance according to claim 1, wherein the constructing a joint objective function L total in the step (3) specifically includes: the joint objective function L total is constructed according to the following formula: L total ＝ɑ·L img （x,x＋δ x ）＋β·L txt （y,y＋δ y ); Wherein, L img is an image countermeasure error term, L txt is a text semantic disturbance error term, delta x and delta y are images and text disturbance respectively, alpha and beta are adjustment parameters, and dynamic self-adaptive adjustment is carried out according to different application scenes and attack targets.
5. The method for generating the attack resistant sample based on the visual language model generated by the cross-modal consistency disturbance according to claim 1, wherein in the step (5), the text gradient is fed back to the image disturbance update formula after normalization processing, and the image gradient is fed back to the text disturbance update formula after normalization processing, specifically: obtaining an image disturbance update formula according to the following formula: ; obtaining a text disturbance update formula according to the following formula: ; Wherein δ t+1 x represents the image disturbance after the t+1st iteration update, δ t x represents the image disturbance after the t iteration update, δ t+1 y represents the text disturbance after the t+1st iteration update, δ t y represents the text disturbance after the t iteration, η is the learning rate, γ is the gradient feedback adjustment parameter, and F (·) is the gradient normalization function.
6. The method of claim 1, wherein the normalizing in step (5) is performed by using a gradient normalization function F (), i.e. dividing the input gradient vector by its L 2 norm.
7. The method for generating a challenge sample based on a visual language model generated by cross-modal consistency perturbation according to claim 1, wherein the iterative updating of the step (4) and the step (5) is repeated in the step (6) until any termination condition is satisfied: The joint objective function value is smaller than a preset threshold; or the disturbance update times reach a preset upper limit; or, the attack success rate against the sample reaches the intended target.
8. The method of claim 1, wherein said post-processing in step (7) comprises reducing perceptibility of the disturbance by smoothing and semantic constraint filters.
9. The method for generating a challenge sample based on a cross-modal consistency disturbance generated visual language model of claim 1, wherein the visual language model is CLIP, BLIP, LLaVA or a variant thereof.
10. An apparatus for generating a challenge sample based on a cross-modal consistency disturbance generated visual language model, the apparatus comprising: A processor configured to execute computer-executable instructions; A memory storing one or more computer-executable instructions which, when executed by the processor, perform the steps of the method of cross-modal consistency disturbance generation based visual language model challenge sample generation of any of claims 1 to 9.
11. A processor for cross-modal consistency disturbance generation based visual language model challenge sample generation, wherein the processor is configured to execute computer executable instructions that, when executed by the processor, implement the steps of the cross-modal consistency disturbance generation based visual language model challenge sample generation method of any of claims 1 to 9.
12. A computer readable storage medium having stored thereon a computer program executable by a processor to implement the steps of the method of cross-modal consistency disturbance generation based visual language model challenge sample generation of any of claims 1 to 9.

Description

Method, device, processor and medium for generating anti-attack sample based on cross-modal consistency disturbance generated visual language model Technical Field The invention relates to the field of computer vision, in particular to the field of natural language processing, and specifically relates to a method, a device, a processor and a computer readable storage medium for generating a visual language model anti-attack sample based on cross-modal consistency disturbance generation. Background With the development of deep learning technology, the visual language model has remarkable results in tasks such as image understanding and text semantic matching, and is widely applied to the fields such as image searching, content generation and automatic labeling. However, these models, while improving robustness, also expose potential risks to combat attacks. Traditional attack countermeasure methods generally only generate perturbations for a single modality (e.g., image), ignoring semantic consistency between the image and text modality in the visual language model. Because the visual language model aims to maintain the cross-modal consistency in the training process, certain gradient relevance and modal coupling exist in the visual language model, if the cross-modal consistency between the image and the text can be simultaneously destroyed by generating disturbance, the model can be more effectively induced to generate error judgment, and thus attack resistance is realized. Disclosure of Invention The invention aims to overcome the defects of the prior art and provide a method, a device, a processor and a computer readable storage medium thereof for generating a visual language model based on cross-mode consistency disturbance generation, which have good consistency, good gradient relevance and good mode coupling. To achieve the above object, a method, an apparatus, a processor and a computer readable storage medium thereof for generating a visual language model based on cross-modal consistency disturbance against attack sample according to the present invention are as follows: The method for generating the anti-attack sample based on the visual language model generated by cross-modal consistency disturbance is mainly characterized by comprising the following steps of: (1) Preprocessing an input image x and a corresponding text y, and extracting respective characteristic information; (2) Inputting the image features and the text features into a visual language model to obtain model intermediate representation and cross-modal associated information; (3) Constructing a joint objective function L total; (4) Computing gradients of image and text modalities respectively using back propagation And; (5) The cross-modal gradient feedback is realized, namely, the text gradient is fed back to the image disturbance updating formula after normalization processing, and the image gradient is fed back to the text disturbance updating formula after normalization processing; (6) Repeating the iterative updating of the step (4) and the step (5) until the termination condition is met, and outputting the countermeasures after the superposition disturbance; (7) And after post-processing the output countermeasure sample, inputting a target visual language model for verification, and evaluating the countermeasure effect by calculating the confidence coefficient and the misjudgment rate output by the model. Preferably, the preprocessing in the step (1) includes normalizing and resizing the image x, and word segmentation and vectorization of the text y. Preferably, the model intermediate representation in the step (2) comprises a multi-scale feature representation of the image and a semantic embedded vector of the text, and the cross-modal correlation information is obtained by calculating correlation coefficients between the image features and the text features. Preferably, in the step (3), a joint objective function L total is constructed, specifically: the joint objective function L total is constructed according to the following formula: Ltotal＝ɑ·Limg（x,x＋δx）＋β·Ltxt（y,y＋δy); Wherein, L img is an image countermeasure error term, L txt is a text semantic disturbance error term, delta x and delta y are images and text disturbance respectively, alpha and beta are adjustment parameters, and dynamic self-adaptive adjustment is carried out according to different application scenes and attack targets. Preferably, in the step (5), the text gradient is fed back to the image disturbance update formula after normalization processing, and the image gradient is fed back to the text disturbance update formula after normalization processing, specifically: obtaining an image disturbance update formula according to the following formula: ; obtaining a text disturbance update formula according to the following formula: ; Wherein δ t+1x represents the image disturbance after the t+1st iteration update, δ tx represents the image disturbance after t