CN-116978103-B - Deformable face fake identification network and time-space consistent face fake identification model construction method

CN116978103BCN 116978103 BCN116978103 BCN 116978103BCN-116978103-B

Abstract

The invention relates to the technical field of human face identification, and discloses a method for constructing a human face identification model with consistent time and space by a deformable human face identification network, wherein the deformable human face identification network comprises a main network and a deformable time sequence self-care network, the main network is used for extracting human face space characteristics of an input image, and the deformable time sequence self-care network is used for processing the human face space characteristics output by the main network so as to extract time sequence characteristics of a human face; the face fake identification model construction method based on space-time consistency is based on a deformable face fake identification network and a time sequence self-attention characteristic mechanism, can relieve interference of semantic characteristic offset among face video frames on time sequence fake characteristic extraction, and therefore effectively improves generalization capability of the fake identification model.

Inventors

GUO ZONGHUI
ZHANG JIE
WEI QIANG
CHEN LUYING
DOU YINAN
ZHANG TIAN
SHAN SHIGUANG

Assignees

北京浩瀚深度信息技术股份有限公司
中科视拓（北京）科技有限公司

Dates

Publication Date: 20260505
Application Date: 20230804

Claims (7)

1. The deformable face pseudo-identification network is characterized by comprising a main network and a deformable time sequence self-attention network, wherein the main network is used for extracting face space characteristics of an input image, and the deformable time sequence self-attention network is used for processing the face space characteristics output by the main network to extract time sequence characteristics of a face; The variable time sequence self-attention network calculates the key point offset between each frame through the known face key point coordinates, and takes the key point offset as a reference value to restrict the accuracy of inter-frame key point high similarity block prediction, and simultaneously calculates the position offset between the block where the key point is positioned and the block most similar to other frames through other positions of the current frame and the block sequence number with the highest similarity with the block on other frames; The backbone network mainly performs the following operations: S01, extracting a block of an input image, and encoding the block to obtain a block sequence; s02, processing block sequence characteristics of the block sequence; the deformable time sequence self-attention network mainly performs the following operations: s11, inputting block sequence characteristics, calculating the similarity among all blocks through dot multiplication, and acquiring a block set with highest similarity of each block at other positions of a current frame and other frames; s12, calculating the cross attention between the blocks and the block set to obtain updated block space-time characteristic codes, namely the time sequence characteristics of the human face.
2. The deformable face authentication network of claim 1 wherein the backbone network and the deformable time-series self-care network are both a Transformer-based model.
3. A method for constructing a face fake identifying model with consistent time and space, which adopts the deformable face fake identifying network as claimed in any one of claims 1-2, and is characterized by comprising the following steps: S1, inputting a mask face video T frame image, and obtaining a reconstructed face video through a first model; S2, improving the first model to obtain a second model, and performing space-time feature extraction and reinforcement on the T frame image of the input face video; And S3, accessing a multi-target supervision assembly on the basis of the second model to form a face fake identification model for carrying out real fake identification on the input face video.
4. A method of constructing a spatiotemporal consistent face pseudo-model in accordance with claim 3, wherein said first model comprises a feature encoding module Space-time self-attention component Decoder 。
5. The method of constructing a spatiotemporal consistent face pseudo model according to claim 4, wherein in step S2, the decoder in the first module is removed, and the feature encoding module is retained Space-time self-attention component Basic module for extracting composition characteristics The feature extraction basic module The facial expression classification module is accessed to conduct expression prediction on the T frame image of the input facial video, and constraint is conducted by adopting an expression category cross entropy loss function The access space self-attention component processes the T frame image of the input face video to obtain feature codes And feature encoding And further, constraining the feature similarity by adopting a cosine similarity loss function.
6. The method according to claim 5, wherein in step S3, the multi-objective supervision module uses a fully connected network of multi-classification modules.
7. The method according to claim 6, wherein in step S3, the face recognition model takes real or fake face video and fake method label as input, outputs fake label Wherein the objective function is expressed as: Wherein i represents a number in C categories, A genuine counterfeit method label representing an input video.

Description

Deformable face fake identification network and time-space consistent face fake identification model construction method Technical Field The invention relates to the technical field of face fake identification, in particular to a face fake identification model construction method with consistent time and space for a deformable face fake identification network. Background In recent years, image/video generation and synthesis algorithms emerge, especially in the aspects of face full-image generation, face replacement, face attribute editing, expression gesture replay and the like, the counterfeiting effect is very realistic, and deep-counterfeiting face images/videos are difficult to effectively screen by means of human eyes or traditional technologies. The construction of false face datasets such as FaceForensics ++, celeb-DF, DFDC, FFIW and the like is benefited, and various face fake identification methods (called Deepfake Detection or Face Forgery Detection) based on deep learning have remarkable performance in a single dataset, and have poor generalization performance of fake identification models due to different fake marks among different fake datasets. The recent research mainly focuses on improving the generalization capability of the fake identification algorithm, one class is to design a corresponding deep learning algorithm by analyzing fake clues implied by the synthesized image mainly from the angles of color distortion, artifacts, GAN fingerprints, high-frequency information and the like, and the other class is to extract the distinguishing characteristics of the space dimension in the fake face image dataset directly by means of deep learning strategies such as attention mechanisms, contrast learning, self-supervision and the like. The false signals in the space dimension of the realistic false face image are fine, the false mode difference among algorithms is large, and generalization of the false identification model is severely restricted. In practice, most forgery algorithms (DEEPFAKES, FACESWAP, STYLEGAN, etc.) only support frame-by-frame face editing or generation, and it is difficult to avoid the problem of appearance differences such as texture, illumination, etc. Especially, the imaging mechanism of the real and fake face video is different, and the generation of the fake face video with consistent time sequence is extremely difficult. Further, humans typically discriminate between tampered face features from both spatial and temporal dimensions. Therefore, the mining of the time sequence characteristic difference among the fake face video frames is a key point for improving the generalization capability of the fake identification model. At present, a few methods are designed and constructed from the view of the time sequence of the face video, and the main methods are as follows: 1) Deepfake Video Detection with Spatiotemporal Dropout Transformer (ACM MM, 2022). The method comprises the main technical means that image blocks on all frames of a face video are randomly sampled, the image blocks are used as input of multi-layer self-attention conversion (transducer), and feature vectors in category tokens are used as judging results of the input video. The method has the problems and the disadvantages that the random sampling strategy at the input end of the method can not fully utilize the complete information of the video, and the image blocks which are not adopted can be key features, so that the generalization capability of the model is restricted. 2) Exploring Temporal Coherence for More General Video Face Forgery Detection (ICCV, 2021). The method comprises the main technical means that based on ResNet-3D, the size of a spatial convolution kernel of a 3D convolution neural network is set to be 1, and the size of a time sequence convolution kernel is kept unchanged, so that the network is constrained to pay attention to time sequence characteristic differences of face videos. The method only extracts time sequence characteristics on the same spatial position between frames in practice, and does not consider the inter-frame semantic characteristic offset of the video caused by face motion. The time sequence features extracted by the method are mixed with semantic deviation features and time sequence counterfeiting features in the same position in practice, and the human face deviation features are more obvious than the time sequence counterfeiting features, so that the extraction and the discrimination of the time sequence counterfeiting features are seriously interfered, and the model performance is limited. Spatiotemporal Inconsistency Learning for DeepFake Video Detection (ACMM MM, 2021). The main technical means is that the two attention mechanism modules of inconsistent space and inconsistent time sequence are utilized to extract the fake characteristics of the fake face video in space and time sequence. The method does not consider the influence of the time sequence characteri