CN-121564810-B - AI generation video detection method, device, equipment and storage medium

CN121564810BCN 121564810 BCN121564810 BCN 121564810BCN-121564810-B

Abstract

The application discloses an AI (advanced technology attachment) generation video detection method, an AI generation video detection device, AI generation video detection equipment and a storage medium, which relate to the field of artificial intelligence and are applied to terminal equipment loaded with a target video detection model; the method comprises the steps of determining target pixel characteristics corresponding to target video frames by utilizing a characteristic extraction layer partition of a target video detection model, inputting the target pixel characteristics corresponding to each target video frame into a frame prediction layer of the target video detection model to determine target prediction probability corresponding to the target video frames, determining target confidence coefficient corresponding to the video to be detected by utilizing a decision fusion layer of the target video detection model based on the target prediction probability corresponding to each target video frame in the video to be detected, and determining the video to be detected as an AI to generate the video when the target confidence coefficient is larger than a preset confidence coefficient threshold value so as to output a corresponding target detection result. The application realizes the high-efficiency detection of the AI generated video.

Inventors

JU WENCONG

Assignees

北京云一科技有限公司
东营职业学院

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (10)

1. An AI-generated video detection method applied to a terminal device carrying a target video detection model, comprising: Acquiring a video to be detected, and processing the video to be detected based on a preset sampling frequency by utilizing a frame extraction layer of the target video detection model so as to acquire a target video frame; determining target pixel differences between adjacent pixels in the target video frame by utilizing a feature extraction layer partition of the target video detection model so as to determine target pixel features corresponding to the target video frame based on the target pixel differences corresponding to the target video frame; Inputting the target pixel characteristics corresponding to each target video frame into a frame prediction layer of the target video detection model, and analyzing the target pixel characteristics of the target video frame by utilizing a target neural network of the frame prediction layer to determine target prediction probability corresponding to the target video frame, wherein the target prediction probability is the probability that the target video frame generates an image for AI; Determining a target confidence coefficient corresponding to the video to be detected based on the target prediction probability corresponding to each target video frame in the video to be detected by utilizing a decision fusion layer of the target video detection model, and determining the video to be detected as an AI generating video when the target confidence coefficient is greater than a preset confidence coefficient threshold value so as to output a corresponding target detection result; The process of determining the target pixel difference between adjacent pixels in the target video frame by utilizing the feature extraction layer partition of the target video detection model to determine the target pixel feature corresponding to the target video frame based on each target pixel difference corresponding to the target video frame includes: starting from a structure of the generator near the input end, which comprises an up-sampling layer up and a sampling layer The analysis process comprises the following steps: Will be As input to the generator, will Generating an input feature map of the structure Outputting the processed up-sampling layer and convolution layer to obtain the target video frame of the video to be detected The specific operation is as follows: ; Wherein the method comprises the steps of 、 The width and height of the frame image respectively, In order to provide the number of channels, Is the feature mapping after up-sampling; Frame video Feature mapping Divided into The size is Respectively using And Representation of And Is a set of grids of (a), And Respectively represent And The first of (3) A grid; Based on The relationship of the adjacent pixels is calculated, Wherein The adjacent pixel relationship is calculated by the following formula: ; Wherein the method comprises the steps of Is that Is a combination of the elements of (1), Is that Adjacent pixel relationship of (a); to all grids in each frame of image Together, adjacent pixel relationship features of the frame image are constructed.
2. The AI-generated video detection method of claim 1, wherein the determining, with the feature extraction layer partition of the target video detection model, a target pixel difference between adjacent pixels in the target video frame comprises: Determining a feature map corresponding to the target video frame by utilizing a feature extraction layer of the target video detection model, and determining each target grid region corresponding to the feature map; Processing adjacent pixels in each of the target grid regions to determine a target pixel difference between adjacent pixels in the target video frame.
3. The AI-generated video detection method of claim 1, wherein analyzing, with the target neural network of the frame prediction layer, target pixel characteristics of the target video frame to determine a target prediction probability corresponding to the target video frame comprises: analyzing the target pixel characteristics of the target video frame based on a target difference mode by utilizing a target neural network of the frame prediction layer so as to determine a target prediction probability corresponding to the target video frame; the target difference mode is a pixel characteristic difference mode of a real image and an AI generated image which are acquired in advance.
4. The AI-generated video detection method of claim 1, further comprising: And calling a target video generation large model by using a first target interface of the target video detection model to acquire a target AI generated video, and acquiring a target real video by using a second target interface of the target video detection model to construct a target video set by using the target AI generated video and the target real video so as to train the target neural network by using the target video set.
5. The AI-generated video detection method of claim 4, wherein the acquiring the video to be detected comprises: Acquiring the target video set, and determining each target video in the target video set as a video to be detected; Correspondingly, the analyzing, by using the target neural network of the frame prediction layer, the target pixel characteristics of the target video frame to determine the target prediction probability corresponding to the target video frame includes: analyzing the target pixel characteristics of the target video frame by utilizing a target second classifier in a target neural network of the frame prediction layer to determine target prediction probability corresponding to the target video frame, and determining a prediction label corresponding to the target video frame based on a preset probability threshold; Determining a predicted correct frame number proportion corresponding to the video to be detected based on the prediction labels corresponding to all the target video frames of the video to be detected, adjusting target parameters of the target two classifiers by using a target loss function when the predicted correct frame number proportion is lower than the target frame number proportion, and jumping to the step of analyzing target pixel characteristics of the target video frames by using the target two classifiers in the target neural network of the frame prediction layer to determine target prediction probability corresponding to the target video frames.
6. The AI-generated video detection method of claim 5, wherein the target loss function is a weighted cross entropy loss function.
7. The AI-generated video detection method of claim 5, further comprising: Determining the proportion of correctly classified videos based on target detection results corresponding to all target videos, and determining the area under a target curve of the target video detection model by adjusting the preset confidence threshold, wherein the area under the target curve comprises a first area under the curve and a second area under the curve, the area under the first curve is the area under the true positive rate curve, and the area under the second curve is the area under the false positive rate curve; and determining an accuracy evaluation result of the target video detection model based on the correctly classified video proportion and the area under the target curve, and updating the target video detection model based on the accuracy evaluation result so as to utilize the updated target video detection model to perform AI (automatic identification) generation video detection.
8. An AI-generated video detection apparatus applied to a terminal device equipped with a target video detection model, comprising: the video frame extraction module is used for acquiring a video to be detected, and processing the video to be detected based on a preset sampling frequency by utilizing a frame extraction layer of the target video detection model so as to acquire a target video frame; The feature extraction module is used for determining target pixel differences between adjacent pixels in the target video frame by utilizing a feature extraction layer partition of the target video detection model so as to determine target pixel features corresponding to the target video frame based on the target pixel differences corresponding to the target video frame; The frame prediction result acquisition module is used for inputting the target pixel characteristics corresponding to each target video frame into a frame prediction layer of the target video detection model so as to analyze the target pixel characteristics of the target video frame by utilizing a target neural network of the frame prediction layer to determine target prediction probability corresponding to the target video frame, wherein the target prediction probability is the probability that the target video frame generates an image for AI; The video detection result output module is used for determining target confidence coefficient corresponding to the video to be detected based on the target prediction probability corresponding to each target video frame in the video to be detected by utilizing a decision fusion layer of the target video detection model, and determining the video to be detected as an AI generating video when the target confidence coefficient is greater than a preset confidence coefficient threshold value so as to output a corresponding target detection result; the feature extraction module is specifically configured to: starting from a structure of the generator near the input end, which comprises an up-sampling layer up and a sampling layer The analysis process comprises the following steps: Will be As input to the generator, will Generating an input feature map of the structure Outputting the processed up-sampling layer and convolution layer to obtain the target video frame of the video to be detected The specific operation is as follows: ; Wherein the method comprises the steps of 、 The width and height of the frame image respectively, In order to provide the number of channels, Is the feature mapping after up-sampling; Frame video Feature mapping Divided into The size is Respectively using And Representation of And Is a set of grids of (a), And Respectively represent And The first of (3) A grid; Based on The relationship of the adjacent pixels is calculated, Wherein The adjacent pixel relationship is calculated by the following formula: ; Wherein the method comprises the steps of Is that Is a combination of the elements of (1), Is that Adjacent pixel relationship of (a); to all grids in each frame of image Together, adjacent pixel relationship features of the frame image are constructed.
9. An electronic device, comprising: A memory for storing a computer program; A processor for executing the computer program to implement the AI-generated video detection method of any of claims 1-7.
10. A computer readable storage medium for storing a computer program, wherein the computer program when executed by a processor implements the AI-generated video detection method of any of claims 1-7.

Description

AI generation video detection method, device, equipment and storage medium Technical Field The present invention relates to the field of artificial intelligence, and in particular, to a method, apparatus, device, and storage medium for detecting AI generated video. Background There are significant limitations to current AI generation video detection techniques. Most of the existing detection means rely on the traditional facial feature analysis method, the method is worry about when facing complex dynamic scenes, such as scenes of frequent news shot switching, dynamic interaction in lectures and the like, but ignores generalization of real videos in multiple fields, and is difficult to meet the requirements of users for detecting the authenticity of the videos in different scenes. Disclosure of Invention In view of the above, an object of the present invention is to provide a method, apparatus, device, and storage medium for detecting AI-generated video, which can realize efficient detection of AI-generated video. The specific scheme is as follows: in a first aspect, the present application discloses an AI-generated video detection method applied to a terminal device carrying a target video detection model, including: Acquiring a video to be detected, and processing the video to be detected based on a preset sampling frequency by utilizing a frame extraction layer of the target video detection model so as to acquire a target video frame; determining target pixel differences between adjacent pixels in the target video frame by utilizing a feature extraction layer partition of the target video detection model so as to determine target pixel features corresponding to the target video frame based on the target pixel differences corresponding to the target video frame; Inputting the target pixel characteristics corresponding to each target video frame into a frame prediction layer of the target video detection model, and analyzing the target pixel characteristics of the target video frame by utilizing a target neural network of the frame prediction layer to determine target prediction probability corresponding to the target video frame, wherein the target prediction probability is the probability that the target video frame generates an image for AI; And determining a target confidence coefficient corresponding to the video to be detected based on the target prediction probability corresponding to each target video frame in the video to be detected by utilizing a decision fusion layer of the target video detection model, and determining the video to be detected as an AI generating video when the target confidence coefficient is greater than a preset confidence coefficient threshold value so as to output a corresponding target detection result. Optionally, the determining, by using the feature extraction layer partition of the target video detection model, a target pixel difference between adjacent pixels in the target video frame includes: Determining a feature map corresponding to the target video frame by utilizing a feature extraction layer of the target video detection model, and determining each target grid region corresponding to the feature map; Processing adjacent pixels in each of the target grid regions to determine a target pixel difference between adjacent pixels in the target video frame. Optionally, the analyzing, by using the target neural network of the frame prediction layer, the target pixel feature of the target video frame to determine a target prediction probability corresponding to the target video frame includes: analyzing the target pixel characteristics of the target video frame based on a target difference mode by utilizing a target neural network of the frame prediction layer so as to determine a target prediction probability corresponding to the target video frame; the target difference mode is a pixel characteristic difference mode of a real image and an AI generated image which are acquired in advance. Optionally, the AI generation video detection method further includes: And calling a target video generation large model by using a first target interface of the target video detection model to acquire a target AI generated video, and acquiring a target real video by using a second target interface of the target video detection model to construct a target video set by using the target AI generated video and the target real video so as to train the target neural network by using the target video set. Optionally, the acquiring the video to be detected includes: Acquiring the target video set, and determining each target video in the target video set as a video to be detected; Correspondingly, the analyzing, by using the target neural network of the frame prediction layer, the target pixel characteristics of the target video frame to determine the target prediction probability corresponding to the target video frame includes: analyzing the target pixel characteristics of the target video