US-12626532-B2 - Method for synthetic video/image detection

US12626532B2US 12626532 B2US12626532 B2US 12626532B2US-12626532-B2

Abstract

A method of identifying synthetic media can include identifying a facial image in video or images, extracting a first set of features from the facial image, extracting a second set of features from the facial image, wherein the first set of features are different than the second set of features, inputting the first set of features into a first prediction model, generating a first output indicative of a nature of the facial image, inputting the second set of features into a second prediction model, generating a second output indicative of the nature of the facial image, and determining the nature of the facial image using the first output and the second output.

Inventors

Saraju P. Mohanty
Elias Kougianos
Alakananda MITRA

Assignees

UNIVERSITY OF NORTH TEXAS

Dates

Publication Date: 20260512
Application Date: 20231011

Claims (16)

1 . A method of identifying synthetic media, the method comprising: identifying a facial image in video or images; extracting a first set of features from the facial image; extracting a second set of features from the facial image, wherein the first set of features are different than the second set of features; inputting the first set of features into a first prediction model; generating a first output indicative of a nature of the facial image; inputting the second set of features into a second prediction model, wherein the first prediction model is a different type of model from the second prediction model; generating a second output indicative of the nature of the facial image; and determining the nature of the facial image using the first output and the second output, wherein determining the nature of the facial image comprises determining that the facial image is real in response to one of the first output or the second output indicating that the facial image is real and one of the first output or the second output indicating that the facial image is synthetic, wherein the one of the first output or the second output indicating that the facial image is real has a higher confidence score than the one of the first output or the second output indicating that the facial image is synthetic.
2 . The method of claim 1 , further comprising: extracting the facial image from the video or images.
3 . The method of claim 2 , wherein the facial image is extracted from a video, and wherein extracting the facial image comprises: extracting key frames from the video; detecting one or more frames containing the facial image; and extracting the facial images from the one or more frames.
4 . The method of claim 3 , further comprising: normalizing the extracted facial images prior to extracting the first set of features or the second set of features.
5 . The method of claim 1 , wherein the first prediction model comprises a convolutional neural network (CNN), and wherein the second prediction model comprises a machine learning (ML) model.
6 . The method of claim 1 , wherein the first set of features comprise textural features of the facial image, and wherein the second set of features comprise global textural features of grayscale version of the facial image.
7 . The method of claim 6 , wherein the global textural features comprise at least one of contrast, homogeneity, correlation, energy, or dissimilarity.
8 . The method of claim 1 , wherein determining the nature of the facial image comprises: determining that the facial image is real in response to the first output indicating that the facial image is real and the second output indicating that the facial image is real.
9 . The method of claim 1 , wherein determining the nature of the facial image comprises: determining that the facial image is synthetic in response to the first output indicating that the facial image is synthetic and the second output indicating that the facial image is synthetic.
10 . A system of identifying synthetic media, the system comprising: a memory storing an analysis application; and a processor, wherein the analysis application, when executed on the processor, configures the processor to: identify a facial image in video or images; extract the facial image from the video or images; extract a first set of features from the facial image; extract a second set of features from the facial image, wherein the first set of features are different than the second set of features; input the first set of features into a first prediction model; generate a first output indicative of a nature of the facial image; input the second set of features into a second prediction model, wherein the first prediction model is a different type of model from the second prediction model; generate a second output indicative of the nature of the facial image; and determine the nature of the facial image using the first output and the second output, wherein the analysis application is configured to determine that the facial image is real in response to one of the first output or the second output indicating that the facial image is real and one of the first output or the second output indicating that the facial image is synthetic, wherein the one of the first output or the second output indicating that the facial image is real has a higher confidence score than the one of the first output or the second output indicating that the facial image is synthetic.
11 . The system of claim 10 , wherein the facial image is extracted from a video, and wherein the system: extract key frames from the video; detect one or more frames containing the facial image; and extract the facial images from the one or more frames.
12 . The system of claim 10 , wherein the first prediction model comprises a convolutional neural network (CNN), and wherein the second prediction model comprises a machine learning (ML) model.
13 . The system of claim 10 , wherein the first set of features comprise textural features of the facial image, and wherein the second set of features comprise global textural features of grayscale version of the facial image.
14 . The system of claim 13 , wherein the global textural features comprise at least one of contrast, homogeneity, correlation, energy, or dissimilarity.
15 . The system of claim 10 , wherein the analysis application further configures the processor to: determine that the facial image is real in response to the first output indicating that the facial image is real and the second output indicating that the facial image is real.
16 . The system of claim 10 , wherein the analysis application further configures the processor to: determine that the facial image is synthetic in response to the first output indicating that the facial image is synthetic and the second output indicating that the facial image is synthetic.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of U.S. Provisional Application No. 63/379,248, filed on Oct. 12, 2022, and entitled “METHOD FOR SYNTHETIC VIDEO/IMAGE DETECTION,” and U.S. Provisional Application No. 63/382,034, filed on Nov. 2, 2022, and entitled “METHOD FOR SYNTHETIC VIDEO/IMAGE DETECTION”, which are both incorporated herein by reference in their entirety for all purposes. STATEMENT REGARDING FEDERALLY FUNDED RESEARCH OR DEVELOPMENT None. BACKGROUND Deep fake videos can be problematic and difficult to detect. Generally, there is a need for a mechanism to detect such deep fake videos. Deepfakes are a new type of threat that fall under the larger and more widespread umbrella of synthetic media. Deepfakes use a form of artificial intelligence and machine learning (AI/ML) to create videos, pictures, audio, and text of events that never happened. These deepfakes look, sound, and feel real. While some uses of synthetic media are just for amusement, others come with a degree of risk. Due to people's inherent tendency to trust what they see, deepfakes and synthetic media can be useful in disseminating misinformation. BRIEF DESCRIPTION OF THE DRAWINGS The drawings aid the explanation and understanding the invention. Since it is not usually possible to illustrate every possible embodiment, the drawings depict only example embodiments. The drawings are not intended to limit the scope of the invention. Other embodiments may fall within the scope of the disclosure and claims. FIG. 1 illustrates a schematic representation of synthetic media detection system according to some embodiments. FIG. 2 illustrates a schematic framework of synthetic media detection system according to some embodiments. FIG. 3 illustrates a schematic flowsheet of the modules in a synthetic media detection system according to some embodiments. FIG. 4 illustrates a schematic representation of the dynamic link libraries for a synthetic media detection system according to some embodiments. FIG. 5 illustrates the initial training procedure for the CNN model according to some embodiments. FIG. 6 illustrates the initial training procedure for the ML model according to some embodiments. FIG. 7 illustrates the partial training procedure for the CNN model according to some embodiments. FIG. 8 illustrates the partial training procedure for the ML model according to some embodiments. FIG. 9 illustrates the process to check for duplicate training videos and images used in the partial training processes according to some embodiments. FIG. 10 illustrates a flowchart of the synthetic media detection method according to some embodiments. FIG. 11 illustrates the results of the generation of key video frames for 20 seconds of video according to some embodiments. FIG. 12 illustrates another flowchart of the synthetic media detection method according to some embodiments. FIG. 13 illustrates a detection API workflow according to some embodiments. DETAILED DESCRIPTION It should be understood at the outset that although an illustrative implementation of one or more embodiments are provided below, the disclosed systems and/or methods may be implemented using any number of techniques, whether currently known or in existence. The disclosure should in no way be limited to the illustrative implementations, drawings, and techniques below, including the exemplary designs and implementations illustrated and described herein, but may be modified within the scope of the appended claims along with their full scope of equivalents. As used herein, the term “and/or” can mean one, some, or all elements depicted in a list. As an example, “A and/or B” can mean A, B, or a combination of A and B. What is more, the use of a “slash” between two elements, such as A/B, can mean A or B. The term deepfake was coined from the words “deep learning” and “fake.” Deepfake images/videos use AI/deep learning technology to alter a person's face, emotion, or speech to that of someone else's face, emotion, or speech. These deepfake images/videos/audios/texts are designed to be indistinguishable from real ones. Cloud computing, public research AI algorithms, and copious data have made the ultimate storm to enable the democratization of deepfakes for distribution via social media platforms at a large scale. Deepfakes and the inappropriate use of synthetic content offer a threat to the public that is undeniable, ongoing, and ever evolving in the areas of national security, law enforcement, the financial sector, and society. As an example, “deepfake” technology can distort reality unbelievably, and the technology disrupts the truth. Deepfakes threaten individuals, businesses, society, and democracy and erode media confidence. Erosion of trust can foster factual relativism, unraveling democracy, and civil society. Deepfakes can help the least democratic and authoritarian governments prosper by using the “liar's dividend,” where unpalatable truths are swiftly reject