CN-121982334-A - Depth forgery detection method, device and medium based on visual language model

CN121982334ACN 121982334 ACN121982334 ACN 121982334ACN-121982334-A

Abstract

A method for detecting depth forging based on visual language model includes obtaining pre-trained visual-language model including visual encoder and text encoder, carrying out disturbance on original text prompt by single jump disturbance on sensitive attribute dimension to generate anti-fact text prompt, freezing text encoder and extracting text feature, simultaneously extracting visual feature of input image and calculating similarity between visual feature and each text feature by visual encoder, regulating affine parameter of layer normalization layer in visual encoder to obtain regulated visual encoder, inputting image to be detected into said encoder and outputting depth forging detection result based on extracted visual feature. The present invention eliminates demographic bias while maintaining maximum detection accuracy by reshaping the feature geometry.

Inventors

ZHANG XUEYI
LI HAIZHOU
Lu Enqiao
GAO SIHAN
SUN JIALU
CAI SIQI

Assignees

香港中文大学（深圳）

Dates

Publication Date: 20260505
Application Date: 20260408

Claims (10)

1.A visual language model-based deep counterfeiting detection method is characterized by comprising the following steps: Obtaining a pre-trained visual-language model, wherein the visual-language model comprises a visual encoder and a text encoder; For an input image, constructing an original text prompt and at least one anti-fact text prompt, wherein the anti-fact text prompt is generated by performing single-hop disturbance on the original text prompt in a sensitive attribute dimension; the text encoder is kept frozen, and text features of an original text prompt and a counterfactual text prompt are extracted; Extracting visual features of the input image through the visual encoder, and calculating similarity between the visual features and each text feature; Adjusting affine parameters of a middle layer normalization layer of the visual encoder according to the difference between the visual characteristics and the similarity of the original text prompt and the counterfactual text prompt to obtain an adjusted visual encoder; Inputting the image to be detected into the adjusted visual encoder, extracting visual characteristics, and outputting a depth forgery detection result based on the extracted visual characteristics.
2. The visual language model based deep forgery detection method of claim 1, wherein the method for constructing the original text prompt and the counterfactual text prompt includes: For an input image, generating an original real prompt and an original false prompt containing sensitive attribute description based on an actual label and corresponding original sensitive attribute of the input image, and taking the original real prompt and the original false prompt as the original text prompt; obtaining at least one group of new sensitive attributes by performing single-hop disturbance on sensitive attribute dimensions based on the original sensitive attributes, wherein the single-hop disturbance is a value of only one sensitive attribute dimension in a single operation; generating a counterfactual real prompt and a counterfactual false prompt based on the new sensitive attribute and the real label, and taking the counterfactual real prompt and the counterfactual false prompt as the counterfactual text prompt; The original sensitive attribute at least comprises a first sensitive attribute and a second sensitive attribute, and the sensitive attribute dimension at least comprises a first sensitive attribute dimension and a second sensitive attribute dimension.
3. The visual language model based deep forgery detection method of claim 2, wherein the method of extracting text features of the original text prompt and the anti-facts text prompt includes: Inputting the original real prompt and the original false prompt into a frozen text encoder respectively to obtain original real text characteristics and original false text characteristics; And respectively inputting the anti-fact real prompt and the anti-fact false prompt into a frozen text encoder to obtain anti-fact real text characteristics and anti-fact false text characteristics.
4. A visual language model based depth forgery detection method as claimed in claim 3, wherein the method of calculating the similarity between the visual language model based depth forgery detection method and each text feature comprises: calculating a first similarity between the visual feature and the original real text feature, a second similarity between the visual feature and the original false text feature, a third similarity between the anti-fact real text feature, and a fourth similarity between the anti-fact false text feature; each similarity is obtained by multiplying the visual feature and the corresponding text feature by a learnable scaling parameter after performing inner product operation.
5. The visual language model based depth forgery detection method of claim 4, wherein the method of adjusting affine parameters of a layer normalization layer in the visual encoder based on differences between the visual features and similarities of original text cues and counterfactual text cues comprises: Calculating an original similarity difference based on the first similarity and the second similarity, the original similarity difference being a difference between the second similarity and the first similarity: Wherein, the For the difference in the original similarity degree, Representing a first similarity of the visual features to the original real text features, Representing a second similarity of the visual feature to the original false text feature; calculating a counterfactual similarity difference based on the third similarity and a fourth similarity, the counterfactual similarity difference being a difference between the fourth similarity and the third similarity: Wherein, the In order to counter the fact that the similarity differences, A third similarity of the visual features to the anti-facts real text features is represented, A fourth similarity representing visual features to anti-fact spurious text features; Building a de-fact equidistant alignment loss based on a mean square error between the original similarity difference and the de-fact similarity difference: Wherein, the In order to counter the fact that equidistant alignment is lost, In order to be of a batch size, For the indexing of samples within a batch, The total number of samples in the batch; the classification losses and the inverse fact equidistant alignment losses are combined to construct a total loss function: Wherein, the As a function of the total loss, In order to classify the loss of the device, The weight coefficients of the constraint intensity are aligned equidistantly for balancing the counter facts; and based on the total loss function, updating affine parameters of a layer normalization layer in the visual encoder by adopting a gradient descent method.
6. The visual language model based depth forgery detection method of claim 5, wherein the layer normalization layer operates to: Wherein, the Represent the first Layer normalization operation of the layers; Is the first Inputting layers; for input of Is the average value of (2); for input of Standard deviation of (2); representing element-by-element multiplication; And Respectively the first Layer-learnable scaling parameters and translation parameters; During the updating process, other parameters of the visual encoder except for the affine parameters of the layer normalization layer and all parameters of the text encoder are kept frozen.
7. The visual language model based depth forgery detection method as claimed in claim 1, wherein the method of inputting an image to be detected into the adjusted visual encoder, extracting visual features, and outputting a depth forgery detection result based on the extracted visual features comprises: inputting the image to be detected into an adjusted visual encoder to obtain adjusted visual characteristics; Inputting the adjusted visual features into a linear classification head to obtain a false probability prediction result; And judging whether the image to be detected is a depth fake image or not according to the fake probability prediction result.
8. The visual language model based deep forgery detection method of claim 2, wherein the sensitive attribute dimensions include a first sensitive attribute dimension and a second sensitive attribute dimension, the single-hop perturbation is to randomly select one dimension from the first sensitive attribute dimension and the second sensitive attribute dimension to change, and only the attribute value of one dimension is changed at a time.
9. An electronic device comprising a memory and a processor, wherein the memory is configured to store a program for supporting the processor to execute the visual language model-based depth forgery detection method according to any one of claims 1 to 8, the processor being configured to execute the program stored in the memory.
10. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor performs the steps of the visual language model based depth forgery detection method as claimed in any one of claims 1 to 8.

Description

Depth forgery detection method, device and medium based on visual language model Technical Field The invention relates to the technical field of image processing and computer vision, in particular to a depth forgery detection method, device and medium based on a visual language model. Background The deep forging (Deepfake) technology utilizes artificial intelligent methods such as a generation countermeasure network, a diffusion model and the like to tamper or synthesize faces in images and videos, and the operations comprise face replacement, expression replay, attribute editing and the like. Therefore, developing efficient and accurate deep forgery detection technology has become an important research direction in the fields of digital evidence collection and information security. Early detection methods of deep forgery are mainly based on manually designed features or traditional convolutional neural networks, and distinguish real content from forgery content by analyzing clues such as spatial texture, frequency domain artifacts, time sequence inconsistencies and the like of images. However, such methods tend to be designed for specific types of counterfeiting, with limited generalization ability in the face of unknown generative models or cross dataset scenarios. In recent years, visual-language models (VLMs) such as CLIP (Contrastive Language-Image Pre-training) exhibit strong characterization capabilities in multimodal learning. The research shows that the visual language model-based deep counterfeiting detection method can obtain the performance superior to that of the traditional special detector by only marking a small amount of samples, and the advantage makes the method a leading technical direction in the field of deep counterfeiting detection. However, with the intensive research, researchers have found that such models have serious fairness problems in that different demographics groups exhibit systematic differences in detection results, and the probability of falsified content of some groups being misclassified is significantly higher than other groups. For example, prior studies have shown that advanced detectors exhibit asymmetric accuracy across different populations, and similarly models produce inconsistent predictions for male and female samples. This deviation of detection depending on identity features means that the model is more concerned with "who appears in the content" than whether the content is falsified ", against the underlying principle that deep falsification detection should be independent of identity information. In view of the fairness problem described above, the prior art proposes a series of mitigation methods. One class of methods attempts to balance error rates among different subgroups by introducing demographic tags for data re-weighting or distribution robust optimization. Another class of methods employs feature decoupling techniques to separate forgery-related features from demographics or explicitly incorporate fairness constraints in the loss function. However, these methods all consider fairness as an external imposed goal, and do post-hoc intervention through feature operations or regularization means, fail to reveal the internal mechanism of deviation generation, so it is difficult to fundamentally eliminate the deviation. In addition, existing depolarization techniques for vision-language models are mainly oriented to general visual tasks, such as prompt editing, representation projection or adapter-based fine tuning, and these methods do not consider complex interactions of forgery cues with demographic attributes in the multi-modal space in deep forgery detection, nor identify the origin position of the deviation inside the model. Therefore, how to locate and correct structural sources of demographic bias in the vision-language model based on model internal mechanisms, and achieve fairness across groups while maintaining high detection accuracy is a technical problem to be solved by those skilled in the art. Disclosure of Invention The invention provides a depth forging detection method, equipment and medium based on a visual language model, which aim to solve the problem of fairness collapse caused by layer normalization layer amplified demographic deviation in the existing depth forging detection method based on the visual language model so as to realize identity-independent fairness and accuracy detection. To achieve the above object, a first aspect of the present invention provides a method for detecting deep forgery based on a visual language model, comprising the steps of: Obtaining a pre-trained visual-language model, wherein the visual-language model comprises a visual encoder and a text encoder; For an input image, constructing an original text prompt and at least one anti-fact text prompt, wherein the anti-fact text prompt is generated by performing single-hop disturbance on the original text prompt in a sensitive attribute dimension; the tex