CN-122020634-A - Frequency constraint visual language pre-training model robustness assessment method and device

CN122020634ACN 122020634 ACN122020634 ACN 122020634ACN-122020634-A

Abstract

The invention discloses a robustness assessment method and equipment for a frequency-constrained visual language pre-training model, and aims to solve the problems that an existing multi-modal model is insufficient in coverage assessment under complex input interference and the like. According to the method, the standard image-text sample is constructed, the frequency constraint strategy is introduced, the frequency constraint loss and the image-text similarity loss are combined in the disturbance optimization process, and the performance of the countermeasure sample in the aspects of visual fidelity and model-crossing universality is improved. Finally, the generated countermeasure sample is input into a plurality of visual language models with heterogeneous structures, and the output error rate of the countermeasure sample in the multi-modal task is counted, so that the robustness performance of the model in a complex scene is quantized. The method has the advantages of no need of accessing internal information of the target model, adaptation to multi-class architecture, comprehensive robustness assessment and the like, and is suitable for safety test and capability verification of a large-scale multi-mode system.

Inventors

WANG XUN
LAI ZEYI
QIAN PENG
LI YUFENG

Assignees

浙江工商大学

Dates

Publication Date: 20260512
Application Date: 20260414

Claims (10)

1. A method for evaluating robustness of a frequency-constrained visual language pre-training model, the method comprising: The method comprises the steps of obtaining an image-text sample pair and carrying out standardization processing; The method comprises the steps of constructing a text disturbance sample according to a text set, respectively inputting the multi-scale image set and the text disturbance sample into an image encoder and a text encoder, optimizing the output of the encoder to obtain a structure disturbance image, measuring the difference between the disturbance image and an original image in a low frequency domain by using a frequency constraint mechanism, and constructing a frequency limiting loss function; constructing an integral optimization objective function, carrying out iterative updating on the disturbance image to obtain a final disturbance image, and combining the final disturbance image with a text disturbance sample to obtain an antagonistic image-text sample pair; And simultaneously inputting the image-text sample pair and the contrast image-text sample pair into the model to be tested, comparing the output of the sample pair, and calculating the attack success rate.
2. The method of claim 1, wherein the normalizing process comprises uniformly adjusting the resolution of the image and then scaling the pixel values to a range of [0,1 ].
3. The method of claim 1, wherein scaling comprises scaling the set of scale factors by scaling the set of scale factors using a scaling function to obtain scaled copies of the image, thereby constructing a set of multi-scale images.
4. The method of claim 1, wherein constructing the text perturbation samples from the set of text comprises replacing words in the set of text with a masking language model to obtain the text perturbation samples.
5. A method of robustness assessment of a frequency constrained visual language pre-training model according to claim 1, wherein optimizing the output of the encoder comprises: and respectively inputting an image encoder and a text encoder of a model aiming at the obtained multi-scale image set and the obtained disturbance text set, wherein the optimization target is maximum image-text similarity.
6. The method for evaluating the robustness of the frequency-constrained visual language pre-training model according to claim 1, wherein the step of measuring the difference between the disturbance image and the original image in the low frequency domain by using a frequency constraint mechanism comprises the steps of performing frequency domain decomposition on the original image and the original disturbance image respectively by using a two-dimensional discrete wavelet transform, extracting low frequency subband components of the original image and the original disturbance image, reconstructing the low frequency components into an approximate image by using an inverse wavelet transform, and finally calculating the difference between the two reconstructed images to construct a frequency limiting loss function.
7. The method for evaluating the robustness of the frequency-constrained visual language pre-training model according to claim 1, wherein the overall optimization objective function is specifically that the graph-text similarity loss and the frequency limitation loss are balanced and added to be used as the overall optimization objective function, and the final disturbance image is obtained after iterative optimization.
8. The method for evaluating the robustness of the frequency-constrained visual language pre-training model according to claim 1, wherein the step of calculating the attack success rate comprises the steps of simultaneously sending the clean image-text sample pair and the contrast image-text sample pair into the visual language pre-training model to be tested, and judging whether the model has misjudgment or not by comparing the output of the positive sample pair and the negative sample pair.
9. A system for implementing the method of any one of claims 1-8, comprising a data preparation module for obtaining raw teletext samples and performing standardized preprocessing; a multi-scale construction module for performing multi-scale scaling on the image to generate a set of images of varying sizes; The text disturbance generation module is used for generating disturbance text based on a text countermeasure algorithm of the mask language model; the image disturbance generation module is used for generating an initial disturbance image by combining the multi-scale image and disturbance text guidance; the frequency constraint calculation module is used for extracting low-frequency components of the image and calculating the perception difference to form a frequency loss term; the disturbance optimization module is used for constructing a total loss function and iteratively optimizing the disturbance direction of the image; and the robustness assessment module is used for assessing the robustness of the model under different multi-modal tasks.
10. A frequency-constrained visual language pre-training model robustness assessment device comprising a memory and one or more processors, the memory having executable code stored therein, wherein the processor, when executing the executable code, implements a frequency-constrained visual language pre-training model robustness assessment method according to any one of claims 1-8.

Description

Frequency constraint visual language pre-training model robustness assessment method and device Technical Field The invention relates to the technical field of artificial intelligence safety and multi-mode model evaluation, in particular to a method and equipment for evaluating robustness of a frequency-constrained visual language pre-training model. Background In recent years, visual language pre-training models (VLPs) have achieved significant results in tasks such as graphic understanding, cross-modal retrieval, image generation, and the like. The representative architecture such as TCL, ALBEF, CLIP models effectively improves the multi-mode semantic alignment capability through large-scale image-text pair joint training. However, a great deal of research shows that such models still have the problem of insufficient robustness when facing carefully constructed challenge samples, and semantic misjudgment is more likely to occur especially in front of migration attacks in black box scenarios. The robustness assessment refers to a process of testing the performance stability of a model under different input disturbance conditions, and is used for measuring the capacity of the model for resisting disturbance and maintaining task accuracy, the existing anti-attack method is mostly focused on disturbance optimization under the assumption of a white box, and the influence of disturbance on a perception system is often ignored, so that obvious visible artifacts are easily caused. In addition, most of the methods carry out disturbance generation under the guidance of single-scale input and single text, lack of modeling of disturbance generalization capability, cause difficulty in migration of an antagonistic sample to heterogeneous models with different structures, lack of effective constraint on the distribution characteristics of disturbance in a frequency space, and are also one of main reasons of low perceived quality of an attack sample and deviation of a robustness evaluation result. The existing assessment method has the following problems: 1. The method is limited in practicability because most of the existing methods assume that the internal structure or gradient information of a target model can be accessed, and the method is difficult to be applied to a closed model or a commercial large model in actual deployment and lacks versatility. 2. The disturbance design lacks structural adaptability, has poor migration performance, is usually generated based on optimization of a single-scale image and a fixed text pair, does not consider spatial scale or semantic diversity, leads to overfitting of an attack sample to a specific model structure, is easy to destroy the integral structure of the image, and has poor cross-architecture migration effect. 3. The disturbance significance is high, and a perception constraint mechanism is lacking, namely the disturbance process does not consider the image frequency domain perception characteristic, and destructive change is easy to generate in a low-frequency region, so that the visual quality of an image is influenced, and the reality concealment and the acceptability of an attack sample are reduced. 4. The lack of unified evaluation indexes makes it difficult to quantify the robustness performance of the model, the general lack of systematic robustness evaluation standards in the existing method, the evaluation mode depends on tasks or subjective selection, and comparable performance references are difficult to form in cross-model and cross-task scenes, so that evaluation results are scattered and lack of credibility. Therefore, it is necessary to propose an anti-disturbance evaluation method for supporting multi-scale and semantic guidance by fusing frequency constraints so as to realize robustness evaluation of the visual language pre-training model in a black box scene. Disclosure of Invention In order to solve the problems of poor disturbance migration resistance, high perception significance of an attack sample, incomplete model robustness assessment and the like in the prior art, the invention enhances the migration and naturalness of the attack sample in a cross-task scene by combining a frequency constraint mechanism, further improves the reliability of the robustness assessment, and provides a frequency constraint visual language pre-training model robustness assessment method and device, which can improve the migration and concealment of the attack sample while guaranteeing the effectiveness of the attack, and realize systematic assessment of the multi-mode pre-training model robustness. The invention aims at realizing the following technical scheme that the method for evaluating the robustness of the visual language pre-training model with frequency constraint comprises the following steps: s1, obtaining an image-text sample pair and carrying out standardized treatment; s2, scaling the standardized image to construct a multi-scale image set; s3, constructin