CN-121982761-A - Unsupervised depth fake video detection method based on wavelet transformation and Mamba state space

CN121982761ACN 121982761 ACN121982761 ACN 121982761ACN-121982761-A

Abstract

The application discloses an unsupervised depth forgery video detection method based on wavelet transformation and Mamba state space, belonging to the technical field of multimedia information security, which comprises the steps of detecting and cutting face images of an original video, decomposing the face images, and constructing a multichannel frequency domain input tensor; extracting the airspace features and the frequency domain features of the face image, distributing an initial pseudo tag for each face image sample, constructing a deep neural network model comprising a dynamic contour convolution and vision state space module, constructing positive and negative sample pairs by the initial pseudo tag, training the deep neural network model through comparison and learning, extracting feature vectors of all frames of a test video by using the trained model, calculating time sequence consistency features, and comparing the time sequence consistency features with a judgment threshold value to judge whether the video is forged or not. The method does not need manual data labeling, has high calculation efficiency, and shows excellent generalization capability and robustness in cross-library test.

Inventors

WANG SHUAI
LIANG WENHAO
GUO JIA
LIU GONGPING

Assignees

电子科技大学长三角研究院(衢州)

Dates

Publication Date: 20260505
Application Date: 20260126

Claims (8)

1. An unsupervised depth forgery video detection method based on wavelet transformation and Mamba state space, characterized in that the method comprises: Performing face image detection and clipping on an original video, decomposing the face image, and constructing a multi-channel frequency domain input tensor, wherein the input tensor comprises a low-frequency component, a horizontal high-frequency component, a vertical high-frequency component and a diagonal high-frequency component; Extracting the airspace characteristics and the frequency domain characteristics of the face image, and distributing an initial pseudo tag for a sample of each face image so as to capture airspace texture anomalies and frequency domain fake marks at the same time; Constructing a deep neural network model comprising a dynamic contour convolution and visual state space module; Constructing positive and negative sample pairs by using the initial pseudo tag, and training the deep neural network model through contrast learning; and extracting feature vectors of all frames of the test video by using the trained model, calculating time sequence consistency features, and comparing the time sequence consistency features with a judgment threshold value to judge whether the video is forged or not.
2. The claim name according to claim 1, wherein said decomposing the face image to construct a multi-channel frequency domain input tensor comprises: The method comprises the steps of performing discrete wavelet transformation on an image, decomposing the image into a low-frequency component, a horizontal high-frequency component, a vertical high-frequency component and a diagonal high-frequency component, and stacking the low-frequency component, the horizontal high-frequency component, the vertical high-frequency component and the diagonal high-frequency component in a channel dimension to obtain an input tensor.
3. The claim name of claim 1, wherein the method for generating the initial pseudo tag comprises: Extracting statistical features of sub-bands corresponding to the high-frequency components, and calculating energy mean value, variance and information entropy of each sub-band to obtain frequency domain statistical features; extracting a micro texture energy map of the face image, and calculating statistics to obtain airspace texture features; and splicing the frequency domain statistical features and the airspace texture features in the channel dimension to form a multi-view integrated feature vector, and clustering the multi-view integrated feature vector after standardized processing to obtain an initial pseudo tag.
4. The unsupervised depth forgery video detection method based on wavelet transform and Mamba state space as claimed in claim 1, further comprising: in the training process of the deep neural network model, noise is injected into high-frequency components of an input tensor according to preset probability, meanwhile, training is paused according to a preset period, the current deep neural network model is utilized to extract features and re-cluster, and pseudo labels of all samples are updated.
5. The unsupervised depth forgery video detection method based on wavelet transform and Mamba state space as claimed in claim 1, wherein the method comprises: And taking the Rho value of the feature sequence as a time sequence consistency feature, judging that the video is falsified if the Rho value is larger than a judging threshold value, and judging that the video is real if the Rho value is smaller than the judging threshold value.
6. The method for unsupervised depth forgery video detection based on wavelet transform and Mamba state space as claimed in claim 5, further comprising: And selecting part of data as a calibration set, calculating average Rho values of two clusters, and automatically calculating an optimal judgment threshold value based on the two types of distribution.
7. The unsupervised depth forgery video detection method based on wavelet transform and Mamba state space according to claim 4, wherein the injecting noise into the high-frequency component of the input tensor with a preset probability includes: In the data loading process of the model training stage, random Gaussian noise is superimposed on a channel corresponding to the diagonal high-frequency component with preset probability.
8. The method for unsupervised depth forgery video detection based on wavelet transform and Mamba state space according to claim 1, wherein training the depth neural network model through contrast learning includes: calculating the feature similarity of an input image in a projection space, constructing positive and negative sample pairs by using pseudo labels, and pulling the feature distances of similar pseudo label samples by minimizing contrast loss so as to push away the feature distances of different types of samples.

Description

Unsupervised depth fake video detection method based on wavelet transformation and Mamba state space Technical Field The application belongs to the technical field of multimedia information security, and particularly relates to an unsupervised depth forgery video detection method based on wavelet transformation and Mamba state space. Background With the rapid development of depth generation models (e.g., GANs, autoEncoders), depth forgery technology (Deepfake) has been able to generate genuine and counterfeit facial videos. These counterfeit videos, if used to disseminate false information, fraud or defaults, pose a serious threat to social trust and public safety. Therefore, it has become urgent to develop efficient and robust deep forgery detection techniques. The current deep counterfeiting detection method mainly has the following limitations: The existing mainstream detection methods (such as CNN or transducer-based classifiers) usually belong to the category of supervised learning, and require massive amounts of labeled (true/false) data for training. However, the cost of obtaining high quality annotation data is extremely high, and once the model faces the type of forgery (Unknown matches) which does not appear in the training set, the detection performance of the model tends to be drastically reduced, and the generalization capability is insufficient. It is difficult to capture frequency domain microscopic marks-depth counterfeit video typically leaves specific artifacts in the frequency domain during generation (checkerboard effect due to sampling above). Conventional methods based on RGB pixel domains often have difficulty capturing these high frequency details sharply, resulting in limited detection accuracy. The model architecture suffers from the trade-off that the traditional Convolutional Neural Network (CNN) is good at extracting local features but lacks a global field of view, while the transform architecture has global modeling capability, but the computational complexity of the transform architecture grows quadratically along with the sequence length, and is difficult to process long video sequences efficiently. Although the state space model (Mamba) recently proposed has linear computational complexity and global modeling capability, how to effectively apply it to capturing locally spurious textures in computer vision tasks remains a technical difficulty. The accuracy problem of the non-supervision method is that the existing few non-supervision detection methods generally rely on simple clustering or abnormal detection, and lack of deep mining on counterfeit features, so that the false alarm rate under complex scenes is high. Disclosure of Invention The application aims to overcome the defects of the prior art and provide an unsupervised depth counterfeit video detection method based on wavelet transformation and Mamba state space, which can automatically identify the depth counterfeit video through frequency domain analysis and iterative self-training under the condition of no label. The aim of the application is achieved by the following technical scheme: An unsupervised depth forgery video detection method based on wavelet transform and Mamba state space, the method comprising: Performing face image detection and clipping on an original video, decomposing the face image, and constructing a multi-channel frequency domain input tensor, wherein the input tensor comprises a low-frequency component, a horizontal high-frequency component, a vertical high-frequency component and a diagonal high-frequency component; Extracting the airspace characteristics and the frequency domain characteristics of the face image, and distributing an initial pseudo tag for a sample of each face image so as to capture airspace texture anomalies and frequency domain fake marks at the same time; Constructing a deep neural network model comprising a dynamic contour convolution and visual state space module; Constructing positive and negative sample pairs by using the initial pseudo tag, and training the deep neural network model through contrast learning; and extracting feature vectors of all frames of the test video by using the trained model, calculating time sequence consistency features, and comparing the time sequence consistency features with a judgment threshold value to judge whether the video is forged or not. Further, the decomposing the face image to construct a multi-channel frequency domain input tensor includes: The method comprises the steps of performing discrete wavelet transformation on an image, decomposing the image into a low-frequency component, a horizontal high-frequency component, a vertical high-frequency component and a diagonal high-frequency component, and stacking the low-frequency component, the horizontal high-frequency component, the vertical high-frequency component and the diagonal high-frequency component in a channel dimension to obtain an input tensor. Further, the method fo