CN-121982758-A - Deep fake face image detection system and method

CN121982758ACN 121982758 ACN121982758 ACN 121982758ACN-121982758-A

Abstract

The application provides a system and a method for detecting a deeply forged face image, relates to the field of artificial intelligence safety, and aims to solve the problems of poor generalization capability, single detection dimension and insufficient robustness in the prior art. The method comprises the steps of generating a three-dimensional geometric feature map according to a face image to be detected, carrying out feature extraction and cross-modal fusion on an original image and the three-dimensional geometric feature map respectively to generate multi-modal fusion features, carrying out frequency domain and airspace analysis in parallel to generate frequency domain-airspace combined features, inputting the two fusion features to a multi-engine decision module comprising a plurality of detection engines, and determining a final detection result according to output of each engine through a decision fusion unit. The method combines multidimensional information and multi-engine decision, effectively improves the accuracy, generalization capability and robustness of detection, and has interpretability.

Inventors

Gan Maozhao

Assignees

深圳艾钜思科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260114

Claims (10)

1. A method for detecting a deeply forged face image, comprising: generating a three-dimensional geometric feature map corresponding to the image according to the face image to be detected; Adopting a first feature extraction module and a second feature extraction module to respectively perform feature extraction on the face image to be detected and the three-dimensional geometric feature map so as to obtain a first modal feature and a second modal feature, and performing interactive fusion on the first modal feature and the second modal feature through a cross-modal fusion module so as to generate a multi-modal fusion feature; Carrying out frequency domain analysis and space domain analysis on the face image to be detected in parallel, and fusing analysis results to generate frequency domain-space domain joint characteristics; And inputting the multi-mode fusion characteristic and the frequency domain-space domain joint characteristic into a multi-engine decision module, and determining a final detection result through a decision fusion unit according to output results of a plurality of detection engines based on different detection principles in the multi-engine decision module.
2. The method of claim 1, wherein the three-dimensional geometric feature map comprises at least one of a depth map, a surface normal map, and a simulated structured light projection pattern generated from a 3D face mesh reconstructed from the face image to be detected.
3. The method of claim 1, wherein the first feature extraction module and the second feature extraction module are both Swin transform-based encoders.
4. The method of claim 1, wherein the frequency domain analysis comprises at least one of a multi-band discrete cosine transform analysis, a phase consistency analysis, and a multi-scale wavelet decomposition, and wherein the spatial domain analysis comprises edge feature extraction using a Sobel operator or a Canny operator.
5. The method of claim 1, wherein the plurality of detection engines based on different detection principles comprises: At least two of a neural fingerprint recognition engine based on metric learning, a manifold anomaly detection engine based on manifold learning, a detection engine based on frequency domain feature classification, and a verification engine based on causal reasoning.
6. The method of claim 1, wherein the cross-modal fusion module employs a cross-modal attention mechanism to enable bi-directional feature interaction between the first modal feature and the second modal feature.
7. The method of claim 1 or 5, wherein the decision fusion unit uses a meta learner to perform a two-level fusion of the output results of the plurality of detection engines and the multi-modal fusion feature and the frequency-spatial domain joint feature to determine a final detection result.
8. The method according to claim 1, wherein the method further comprises: After the model is deployed, an online incremental learning mechanism is adopted, and the newly collected fake samples are utilized to continuously update the first feature extraction module, the second feature extraction module, the cross-mode fusion module and the multi-engine decision module.
9. The method according to claim 1, wherein the method further comprises: An interpretability report is generated according to the detection process, wherein the interpretability report comprises at least one of a pixel level abnormal thermodynamic diagram, a regional level abnormal analysis of a human face semantic region, a characteristic dimension analysis with the largest contribution to the detection result and a multi-engine consistency analysis.
10. A depth counterfeit face image detection system, comprising: the three-dimensional geometrical feature map generating module is used for generating a three-dimensional geometrical feature map corresponding to the image according to the face image to be detected; The first feature extraction module is used for extracting features of the face image to be detected so as to obtain first modal features; The second feature extraction module is used for carrying out feature extraction on the three-dimensional geometric feature map so as to obtain a second modal feature; the cross-modal fusion module is used for carrying out interactive fusion on the first modal characteristics and the second modal characteristics so as to generate a multi-modal fusion characteristic; The frequency domain-space domain combined characteristic generation module is used for carrying out frequency domain analysis and space domain analysis on the face image to be detected in parallel and fusing analysis results to generate a frequency domain-space domain combined characteristic; The multi-engine decision module comprises a plurality of detection engines based on different detection principles; and the decision fusion unit is used for determining a final detection result according to the output results of the plurality of detection engines in the multi-engine decision module.

Description

Deep fake face image detection system and method Technical Field The application relates to the technical field of artificial intelligence security, in particular to a deep fake face image detection system and method for verifying the authenticity of digital content. Background In recent years, with the rapid development of deep learning techniques, deep forgery techniques typified by generation of countermeasure networks, diffusion models, and the like can generate images or videos which are almost visually indistinguishable from real faces. These highly realistic counterfeited content are abused to make false news, make financial fraud, compromise personal reputation, etc., and pose a serious threat to the social security and public trust system. Existing depth-forgery detection techniques rely primarily on extracting features from a single two-dimensional image. For example, some methods use convolutional neural networks to focus on extracting local texture artifacts of an image, while others use visual convertors to capture global semantic features. In addition, there are methods that attempt to identify counterfeit marks by analyzing frequency domain features of an image, such as discrete cosine transform or fourier transform spectrum. Some advanced methods adopt a double-flow network architecture, process the color texture information flow and the frequency domain feature flow of the image in parallel, and then combine the two to classify so as to combine the fake trace of the space domain and the frequency domain for detection. However, the prior art has the common defects that firstly, the generalization capability is insufficient, the prior art method is often overfitted to artifacts generated by a specific generation model, and the detection performance of the prior art method is drastically reduced when facing to high-quality fake images generated by adopting a brand new architecture. Secondly, the feature dimension is single, most methods only depend on two-dimensional texture and color information of the image, geometrical structure consistency of a face serving as a three-dimensional object is ignored, and even though a fake image is seamless in two dimensions, an implied three-dimensional geometrical structure of the fake image is unreasonable. Again, the robustness is poor, and existing detection models are susceptible to common image processing (e.g., compression, noise) and interference from malicious challenge, resulting in significant reduction in detection accuracy. Finally, the interpretation is lost, the existing deep learning model is mostly a 'black box', only a true or false judgment can be given, a specific basis for making the judgment can not be provided, and the application requirements of scenes requiring strong interpretation such as judicial evidence obtaining, financial wind control and the like are difficult to meet. Disclosure of Invention The application aims to provide a deep fake face image detection system and method, and aims to solve the technical problems that in the prior art, when a novel generation model is faced, generalization capability is poor, detection dimension is insufficient, robustness is insufficient and interpretability is lacking due to dependence on single-mode information. In order to achieve the above object, the present application provides a deep fake face image detection method, which includes the following steps: generating a three-dimensional geometric feature map corresponding to the image according to the face image to be detected; Adopting a first feature extraction module and a second feature extraction module to respectively perform feature extraction on the face image to be detected and the three-dimensional geometric feature map so as to obtain a first modal feature and a second modal feature, and performing interactive fusion on the first modal feature and the second modal feature through a cross-modal fusion module so as to generate a multi-modal fusion feature; Carrying out frequency domain analysis and space domain analysis on the face image to be detected in parallel, and fusing analysis results to generate frequency domain-space domain joint characteristics; And inputting the multi-mode fusion characteristic and the frequency domain-space domain joint characteristic into a multi-engine decision module, and determining a final detection result through a decision fusion unit according to output results of a plurality of detection engines based on different detection principles in the multi-engine decision module. Optionally, the three-dimensional geometrical feature map includes at least one of a depth map, a surface normal vector map and a simulated structured light projection pattern generated from a 3D face mesh reconstructed from the face image to be detected. Optionally, the first feature extraction module and the second feature extraction module are both Swin transducer based encoders. Optionally, the frequency domain analys