CN-122024338-A - Method, apparatus and medium for deep forgery detection using visual language model

CN122024338ACN 122024338 ACN122024338 ACN 122024338ACN-122024338-A

Abstract

The invention relates to a method, equipment and medium for detecting deep forgery by using a visual language model, which comprises the steps of constructing a text prompt set containing a plurality of forgery-related text prompts, processing an input image by using a visual encoder to obtain an original visual marker sequence, performing uncertainty sampling on the original visual marker sequence to obtain a plurality of uncertainty-aware visual marker sequences, calculating the similarity between visual markers in each visual marker sequence and each text prompt in the text prompt set based on the text prompt set, selecting at least one key visual marker from the visual marker sequence according to the similarity, replacing the selected key visual marker with a global semantic marker of a specific layer in the visual encoder to obtain an enhanced visual representation, and outputting a forgery detection result of the input image based on the enhanced visual representation. The method enhances the perception capability of the model to the fake region while keeping the pre-training knowledge, and remarkably improves the detection accuracy and robustness.

Inventors

ZHANG XUEYI
LI HAIZHOU
GAO SIHAN
SUN JIALU
CAI SIQI

Assignees

香港中文大学（深圳）

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (10)

1. A method for deep forgery detection using a visual language model, comprising the steps of: constructing a text prompt set comprising a plurality of forgery-related text prompts; processing an input image by using a visual encoder to obtain an original visual mark sequence corresponding to the input image; Performing uncertainty sampling on the original visual marker sequence to obtain a plurality of uncertainty perceived visual marker sequences; Calculating the similarity between the visual marks in the visual mark sequences perceived by each uncertainty and each text prompt in the text prompt set based on the text prompt set; selecting at least one key visual marker from the plurality of uncertainty-aware visual marker sequences according to the similarity; replacing the global semantic mark of a specific layer in the visual encoder with the selected at least one key visual mark to obtain an enhanced visual representation; Based on the enhanced visual representation, a forgery detection result of the input image is output by a classifier.
2. The method for deep forgery detection with a visual language model of claim 1, wherein the set of text cues includes a fixed text cue that is a template that is designed manually and a learner text cue that is a mixed sequence of a learner context vector and a fixed context vector.
3. The method for depth counterfeit detection with a visual language model of claim 1, wherein said original sequence of visual markers is uncertainty sampled to obtain a plurality of uncertainty-aware sequences of visual markers, the method comprising: inputting the original visual marker sequence into a Bayesian adapter, wherein the Bayesian adapter is realized by a multi-layer perceptron block; Performing a plurality of monte carlo Dropout random forward propagation by the bayesian adapter to approximate bayesian reasoning and model a prediction uncertainty; A plurality of uncertainty-aware visual marker sequences are generated as sampling results.
4. The method for detecting deep forgery by using a visual language model according to claim 2, wherein when calculating the similarity between the visual markers in the visual marker sequence perceived by each uncertainty and each text prompt in the text prompt set, a first similarity between each visual marker in the visual marker sequence and the learnable prompts in the text prompt set and a second similarity between each visual marker in the visual marker sequence and the learnable prompts in the text prompt set are calculated, respectively, and the first similarity and the second similarity are fused according to a preset weight to obtain an aggregate similarity as the similarity between the visual marker and the text prompt set.
5. The method for deep forgery detection with a visual language model of claim 4, wherein the method for selecting at least one key visual marker from the plurality of uncertainty-aware visual marker sequences based on the similarity comprises: integrating the aggregate similarity of each visual marker in each uncertainty-aware visual marker sequence to obtain an aggregate similarity score corresponding to each visual marker in the visual marker sequence; Sorting all visual markers in the visual marker sequence according to the aggregate similarity score of each visual marker in the visual marker sequence; and selecting the first K visual marks with the highest aggregate similarity score as key visual marks, wherein K is a preset positive integer.
6. The method for depth counterfeit detection with a visual language model of claim 1, wherein replacing global semantic tags of a particular layer in said visual encoder with said selected at least one key visual tag to obtain an enhanced visual representation comprises: determining a transducer block range needing to be subjected to replacement operation according to a preset injection depth; for each Transformer block within the range, replacing the global semantic tags in its input sequence with the selected key visual tags; for each transducer block outside the range, the global semantic mark output by the previous block is transferred, and no replacement operation is performed.
7. A method for deep forgery detection with a visual language model as claimed in claim 3, further comprising a training process in which model parameters are optimized using a total loss function including classification loss, forgery localization stage alignment loss, and forgery verification stage verification loss; The method for calculating the alignment loss in the fake positioning stage comprises the steps of calculating a first similarity between a global semantic mark output by the visual encoder and the text prompt set, calculating a second similarity between refined local features obtained by aggregation according to the key visual marks and the text prompt set, and taking the sum of the first similarity and the second similarity as the alignment loss; The method for calculating the verification loss in the forgery verification stage comprises the steps of inputting the enhanced visual representation into a second adapter, generating a similarity score of text alignment, and calculating cross entropy loss based on the similarity score to serve as the verification loss.
8. The method for deep forgery detection by using a visual language model of claim 7, wherein the computing method of refined local features comprises performing weighted average aggregation on all visual markers in the plurality of uncertainty-aware visual marker sequences according to an aggregate similarity score of each visual marker to obtain the refined local features, and wherein the second adapter shares the same multi-layer perceptron architecture with the Bayesian adapter.
9. An electronic device comprising a memory and a processor, wherein the memory is configured to store a program for supporting the processor to perform the method for deep forgery detection using a visual language model of any one of claims 1 to 8, the processor being configured to execute the program stored in the memory.
10. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor performs the steps of the method for deep forgery detection using a visual language model as claimed in any of claims 1-8.

Description

Method, apparatus and medium for deep forgery detection using visual language model Technical Field The invention relates to the technical field of digital image evidence obtaining and multimedia security, in particular to a method, equipment and medium for detecting deep forgery by using a visual language model. Background The deep forging (Deepfake) technology utilizes artificial intelligent methods such as generation of an countermeasure network, a diffusion model and the like to tamper with faces in images or videos, and the operations comprise face replacement, face replay and the like. With the rapid development of the generated model, the fidelity of the deeply forged content is continuously improved, even to the extent that human eyes are difficult to distinguish, and serious threat is formed to social public safety, personal privacy and information credibility. Therefore, the efficient and accurate deep forgery detection technology has become a research hotspot in the fields of digital image evidence collection and multimedia security. Currently, a deep forgery detection method based on a Vision-Language model (VLMs) is receiving a great deal of attention. Vision-language models (e.g., CLIP) can understand both visual content and semantic descriptions simultaneously by pre-training on large-scale image-text pairs, learning cross-modality alignment capabilities. Existing deep forgery detection techniques based on visual-language models are mainly divided into two categories, namely an adapter fine tuning method, which enhances the sensitivity of the visual backbone network to tamper marks by adding specific task modules to the visual backbone network, such as StA, through time-space adapter improvement time modeling, forda introducing a forensic perception adapter highlight tamper area, EFFORT adjusting CLIP in an orthogonal subspace to preserve pre-training knowledge, and a prompt learning method, which adjusts decision boundaries by optimizing text prompts to make the model adjust decision boundaries, such as RepDFD using learnable visual disturbances and sample-adaptive text prompts, CLIPping adjusting compact context Token, VLFFD synthesizing fine-grained sentence-level prompts through a text-image generator. However, the prior art still has the defects that firstly, most methods only consider a vision-language model as a pure vision feature extractor, only utilize the vision coding capability of the vision-language model and neglect the potential of a language mode in terms of revealing counterfeit clues, secondly, the adapter fine tuning method enhances the vision feature modeling, but uses text information indirectly, and fails to actively guide the positioning of counterfeit areas by using text semantics, thirdly, the prompt learning method optimizes classification boundaries, but does not deeply explore fine grain alignment relations between text prompts and visual markers, and cannot directly position specific areas perceived as true or false by the vision model, and fourthly, as the generated model is developed, traditional visible artifacts gradually disappear, the sensitivity of the existing method to the fine counterfeit clues is insufficient, and high fidelity and novel tampering technologies are difficult to deal with. Disclosure of Invention The invention provides a method, equipment and medium for carrying out deep counterfeiting detection by utilizing a visual language model, and aims to solve the technical problem that the sensitivity to fine counterfeiting clues is insufficient because a visual-language model is only used as a pure visual feature extractor and a counterfeiting area cannot be actively revealed by utilizing text semantics in the prior art. To achieve the above object, a first aspect of the present invention provides a method for deep forgery detection using a visual language model, comprising the steps of: constructing a text prompt set comprising a plurality of forgery-related text prompts; processing an input image by using a visual encoder to obtain an original visual mark sequence corresponding to the input image; Performing uncertainty sampling on the original visual marker sequence to obtain a plurality of uncertainty perceived visual marker sequences; Calculating the similarity between the visual marks in the visual mark sequences perceived by each uncertainty and each text prompt in the text prompt set based on the text prompt set; selecting at least one key visual marker from the plurality of uncertainty-aware visual marker sequences according to the similarity; replacing the global semantic mark of a specific layer in the visual encoder with the selected at least one key visual mark to obtain an enhanced visual representation; Based on the enhanced visual representation, a forgery detection result of the input image is output by a classifier. Further, the text prompt set includes a fixed text prompt that is a template for an artificial design and a learn