CN-121999330-A - Multi-mode detection method, system, equipment and medium for AI generated face

CN121999330ACN 121999330 ACN121999330 ACN 121999330ACN-121999330-A

Abstract

A multi-mode detection method, system, equipment and medium for AI generated face relates to the artificial intelligence technical field. The method comprises the steps of obtaining multi-modal data to be detected, wherein the multi-modal data at least comprises visual data and text data, visual features and text features are respectively extracted, the visual features and the text features are fused, the multi-modal features are obtained, a preset detection model is trained by combining the generated countermeasure sample, AI is conducted through the detection model, face detection is generated, and a detection result is output. By adopting the technical scheme, the accuracy and the robustness of the AI generated face detection are improved, the problems of insufficient static countertraining generalization and poor multi-mode counterfeiting scene adaptation are solved, and the detection requirement of dynamic countermeasures is met.

Inventors

MAO XIUPING
WANG YOUJIN
GUAN JIYU
CHEN SHUO

Assignees

苏州创旅天下信息技术有限公司

Dates

Publication Date: 20260508
Application Date: 20251224

Claims (10)

1. A multi-modal detection method for AI generated faces, the method comprising: Acquiring multi-modal data to be detected, wherein the multi-modal data at least comprises visual data and text data, and respectively extracting to obtain visual characteristics and text characteristics; fusing the visual features and the text features to obtain multi-modal features, and training a preset detection model by combining the generated countermeasure sample; And carrying out AI (advanced technology attachment) generation face detection through the detection model, and outputting a detection result.
2. The AI-generated face multi-modality detection method of claim 1, wherein the visual data includes: The first face data is obtained through a public face data set or a self-built real face database; The second face data is obtained by carrying out local editing on the first face data, and the local editing at least comprises hair style changing, expression adjusting and face changing operations; The text data includes: the first descriptive text is matched with the corresponding visual data; and a second descriptive text contradicting the corresponding visual data, the contradiction comprising facial features of the descriptive text not consistent with actual facial features in the visual data.
3. The AI-generated face multi-modality detection method of claim 1, wherein the visual feature extraction step includes: Extracting spatial domain features and frequency domain features of the visual data by adopting a preset visual model, and respectively calculating to obtain spatial domain difference scores and frequency domain difference scores; And carrying out weighted summation on the spatial domain difference score and the frequency domain difference score according to preset weights to obtain visual counterfeiting probability, wherein the visual counterfeiting probability is used for representing the visual characteristics.
4. The AI-generated face multi-modal detection method of claim 3, wherein the text feature extraction step includes: Carrying out semantic analysis on the text data by adopting a preset language model, and obtaining semantic feature vectors of the text data as the text features; The step of fusing the visual features and the text features comprises the following steps: And calculating the association weight between the visual feature and the text feature, obtaining weighted visual feature vectors and text feature vectors based on the association weight, and obtaining the multi-modal feature.
5. The AI-generated face multi-modality detection method of claim 1, wherein the challenge sample generating step includes: generating a challenge sample based on a preset period; Optimizing a generation strategy of a challenge sample by a reinforcement learning agent, wherein a reward function of the reinforcement learning agent comprises: if the countermeasure sample is input into the detection model of the current stage, the model outputs an error result, and the reinforcement learning intelligent agent is given positive rewards R1; if the disturbance amplitude of the countermeasure sample is larger than a preset vision invisible threshold value, giving negative rewards R2 to the reinforcement learning agent; and if the attack area of the countermeasure sample is larger than the preset face key point range, giving negative rewards R3 to the reinforcement learning agent.
6. The AI-generated face multi-modal detection method of claim 1, wherein the training step of the preset detection model includes: inputting the multi-modal features to the detection model, outputting a classification result, and calculating classification loss between the classification result and the visual data; based on the classification loss, parameters of the detection model are updated.
7. The AI-generated face multi-modal detection method of claim 6, wherein the training step of the preset detection model includes: Constructing a joint loss function, wherein the joint loss function is a weighted summation of the classification loss and the antagonism loss, and a calculation formula comprises: loss=α×L cls +β×L Adv ; Wherein loss is the joint loss function, L cls is the classification loss, L Adv is the countermeasures loss, and alpha and beta are the weight coefficients of the classification loss and the countermeasures loss respectively; based on the joint loss function, updating parameters of the detection model, and training until the detection accuracy of the detection model is greater than or equal to a preset accuracy threshold.
8. A multi-modality AI-generated face detection system, the system comprising: The feature extraction module is used for acquiring multi-mode data to be detected and respectively extracting visual features and text features; the feature fusion module is used for calculating the association weight of the visual feature and the text feature, obtaining weighted feature vectors based on the association weight, fusing the feature vectors and outputting multi-mode features; The model training module is used for inputting the multi-mode characteristics into a preset detection model, outputting a classification result, calculating loss and updating model parameters based on the loss; The output module is used for inputting the multi-mode characteristics of the target to be detected into the trained detection model, executing AI to generate human face detection, and outputting the detection result comprising the classification result and the confidence coefficient.
9. An electronic device comprising a processor, a memory, a user interface, and a network interface, the memory to store instructions, the user interface and the network interface to communicate to other devices, the processor to execute the instructions stored in the memory to cause the electronic device to perform the method of any of claims 1-7.
10. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method according to any of claims 1-7.

Description

Multi-mode detection method, system, equipment and medium for AI generated face Technical Field The application relates to the technical field of artificial intelligence, in particular to a multi-mode detection method, a system, equipment and a medium for AI generated faces. Background With the rapid development of picture generation models, the face counterfeiting technology is gradually mature, and deep counterfeiting face photos can be generated or edited by using AI generation models such as a generation countermeasure Network (GENERATIVE ADVERSARIAL Network, GAN), a Stable Diffusion model (Stable Diffusion) and the like, and the technology has great negative influence in application, so that great potential safety hazards and challenges are caused for the existing identity verification system, and therefore, research on an AI generation face detection method becomes a necessary link for improving the protection capability of the identity verification system. The current AI generation face detection technology is mainly implemented in three modes, namely a single-mode analysis Method, a frequency domain analysis Method based on convolutional neural networks (Convolutional Neural Network, CNN) is specifically adopted, AI generation face detection is achieved through analysis of frequency domain features of images, a static countermeasure training Method, specifically a rapid gradient sign Method (FAST GRADIENT SIGN Method, FGSM) is used for generating countermeasure samples in advance, the generated countermeasure samples are used for training a detection model to improve the detection capability of the model on specific countermeasure samples, and a model generation tracing Method is used for judging whether the images are AI generation or not according to fingerprint information such as specific noise modes existing in GAN generation images and achieving AI generation face detection. However, the prior art has the obvious defects that a single-mode analysis method cannot identify cross-mode contradiction and fails to work on an edited image, if a real face is locally tampered, such as a change type and an adjustment expression, the dependent frequency domain characteristics of the real face are possibly damaged, so that detection misjudgment is caused, a static countermeasure training method is insufficient in generalization and cannot adapt to dynamic attack, a countermeasure sample coverage area generated in advance is limited, various attack scenes are difficult to deal with, a model tracing method is limited in applicability, the edited image fails, a novel AI generation model cannot be deal with, and the defects commonly cause the remarkable reduction of detection accuracy and robustness of the existing method under the dynamic countermeasure environment and the complex multi-mode counterfeiting scene. Disclosure of Invention The application provides a multi-mode detection method, a system, equipment and a medium for an AI-generated face, which are used for solving the problems of insufficient accuracy and robustness of the existing AI-generated face detection method. In a first aspect, the present application provides a multi-mode detection method for AI-generated faces, the method comprising: the method comprises the steps of obtaining multi-modal data to be detected, wherein the multi-modal data at least comprise visual data and text data, and respectively extracting visual characteristics and text characteristics; fusing visual features and text features to obtain multi-modal features, and training a preset detection model by combining the generated countermeasure sample; And carrying out AI (advanced technology attachment) generation face detection through a detection model, and outputting a detection result. By adopting the technical scheme, the accuracy and the robustness of the AI generated face detection are improved, the problems of insufficient static countertraining generalization and poor multi-mode counterfeiting scene adaptation are solved, and the detection requirement of dynamic countermeasures is met. In a specific embodiment, the visual data comprises: The first face data is obtained through a public face data set or a self-built real face database; The second face data is obtained by carrying out local editing on the first face data, and the local editing at least comprises the operations of changing a hairstyle, adjusting an expression and changing a face; the text data includes: the first descriptive text is matched with the corresponding visual data; The second descriptive text contradicts the corresponding visual data, the contradiction including facial features of the descriptive text not consistent with actual facial features in the visual data. By adopting the technical scheme, the detection sensitivity of the local tampering face is improved, the gap of cross-mode contradiction recognition is filled, a more comprehensive detection data system is constructed, and a real,