CN-121982600-A - Digital human video detection method and device

CN121982600ACN 121982600 ACN121982600 ACN 121982600ACN-121982600-A

Abstract

The embodiment of the invention discloses a digital human video detection method and device. The method comprises the steps of respectively extracting video frames from digital human videos generated by a generation type model to form a first frame sequence and a second frame sequence with different lengths, inputting the first frame sequence into a first detection model comprising a first backbone network to obtain a first detection score, inputting the second frame sequence into a second detection model comprising a second backbone network to obtain a second detection score, wherein the first backbone network and the second backbone network have different model structures, and determining a target detection score for representing the generation quality of the digital human videos at least based on the first detection score and the second detection score. According to the method, the multi-mode model with heterogeneous structure is introduced to process video frames with different time scales, the multi-view evaluation result is effectively fused, and the missing or misjudgment of a single model on complex generation defects is avoided, so that the comprehensiveness and the robustness of the detection result are improved, and the real generation quality of the digital human video is more accurately reflected.

Inventors

ZENG JISHEN
CHEN BAOYING
YANG RUI

Assignees

阿里巴巴（中国）有限公司

Dates

Publication Date: 20260505
Application Date: 20251229

Claims (14)

1. A digital human video detection method, the method comprising: acquiring a digital human video generated by a generated model; extracting a plurality of frames from the digital human video to form a first frame sequence; Extracting a plurality of frames from the digital human video to form a second frame sequence, wherein the length of the first frame sequence is different from the length of the second frame sequence; Inputting the first frame sequence into a first detection model to determine a first detection score, wherein the first detection model comprises a first backbone network and a first task head network; Inputting the second frame sequence into a second detection model to determine a second detection score, wherein the second detection model comprises a second backbone network and a second task head network, and the second backbone network and the first backbone network have different model structures; Determining a target detection score of the digital human video at least according to the first detection score and the second detection score, wherein the target detection score is used for representing the generation quality of the digital human video.
2. The method of claim 1, wherein the inputting the first sequence of frames into a first detection model to determine a first detection score comprises: Respectively extracting image features of each image frame in the first frame sequence based on an image embedding network to determine a corresponding first feature sequence; Performing multi-frame aggregation on the first feature sequence to determine corresponding aggregation features; And inputting the aggregation characteristic into a first task head network, and determining the first detection score.
3. The method of claim 2, wherein the multi-frame aggregating the first sequence of features to determine the corresponding aggregate features comprises: splicing the image features in the first feature sequence according to time sequence to form a time sequence feature matrix; And carrying out nonlinear aggregation processing on the time sequence feature matrix to determine corresponding aggregation features.
4. The method of claim 1, wherein said inputting the second sequence of frames into a second detection model to determine a second detection score comprises; Extracting video global features corresponding to the second frame sequence based on a video global embedded network; And inputting the video global features into the second task head network, and determining the second detection score.
5. The method according to claim 1, wherein the method further comprises: Extracting a plurality of frames from the digital human video to form a third frame sequence; extracting a plurality of frames from the digital human video to form a fourth frame sequence; Inputting the third frame sequence into a third detection model to determine a third detection score, wherein the third detection model comprises a third backbone network and a third task head network, and the third backbone network and the first backbone network have the same model structure; Inputting the fourth frame sequence into a fourth detection model to determine a fourth detection score, wherein the fourth detection model comprises a fourth backbone network and a fourth task head network, and the fourth backbone network and the second backbone network have the same model structure; Wherein the length of the first frame sequence, the length of the second frame sequence, the length of the third frame sequence, and the length of the fourth frame sequence are all different; The determining a target detection score of the digital human video based at least on the first detection score and the second detection score comprises: The target score is determined from the first, second, third, and fourth detection scores.
6. The method of claim 1, wherein the first backbone network comprises EVA and NeXtVLAD and the second backbone network comprises XCLIP.
7. The method of claim 1, wherein the determining the target detection score for the digital human video based at least on the first detection score and the second detection score comprises: inputting the first detection score and the second detection score into a fusion task head to determine the target detection score; the fusion task head is used for respectively determining a first weight corresponding to a first detection score and a second weight corresponding to a second detection score, and determining the target detection score according to the first detection score, the second detection score, the first weight and the second weight.
8. The method of claim 1, wherein the first detection model and the second detection model are trained by: obtaining a model training set, wherein the model training set comprises a plurality of digital human video samples and corresponding manual scores; Based on the model training set, respectively performing independent fine tuning training on the first detection model and the second detection model; and performing joint training on the first task head network and the second task head in response to the independent fine tuning training being completed.
9. The method of claim 8, wherein the independently fine-tuning the first and second detection models, respectively, comprises: extracting video frames from the digital human video samples to determine a training frame sequence of a corresponding length; inputting the training frame sequence into a to-be-trained model to obtain a corresponding detection score, wherein the to-be-trained model is the first detection model or the second detection model; calculating multi-objective losses according to the detection scores and the manual scores, wherein the multi-objective losses comprise L1 regression loss items and sorting consistency loss items; updating parameters of a corresponding detection model according to the multi-objective loss; And if the manual score of the first digital human video sample is higher than that of the second digital human video sample, applying constraint to enable the detection score of the first digital human video sample to be not lower than that of the second digital human video sample.
10. The method of claim 8, wherein the joint training of the first task head network and the second task head comprises: freezing parameters of the first backbone network and the second backbone network; Extracting a plurality of frames from the digital human video sample to form a first training frame sequence; extracting a plurality of frames from the digital human video sample to form a second training frame sequence; inputting the first training frame sequence into a first detection model which is finely adjusted independently so as to determine a first detection score; inputting the second training frame sequence into an independently fine-tuned second detection model to determine a second detection score; Inputting the first detection score and the second detection score into a fusion task head to obtain a fusion detection score; calculating multi-objective losses according to the fusion detection score and the manual score, wherein the multi-objective losses comprise an L1 regression loss term and a sorting consistency loss term; And updating parameters of the fusion task head according to the multi-objective loss.
11. The method of claim 8, wherein the joint training of the first task head network and the second task head comprises: Extracting a plurality of frames from the digital human video sample to form a first training frame sequence; extracting a plurality of frames from the digital human video sample to form a second training frame sequence; inputting the first training frame sequence into a first detection model which is finely adjusted independently so as to determine a first detection score; inputting the second training frame sequence into an independently fine-tuned second detection model to determine a second detection score; inputting the first detection score and the second detection score into a shared calibration module, and performing scale alignment through a monotonic mapping function to obtain a first calibration score and a second calibration score; inputting the first calibration score and the second calibration score into a fusion task head to obtain a fusion detection score; calculating multi-objective losses according to the fusion detection score and the manual score, wherein the multi-objective losses comprise an L1 regression loss term and a sorting consistency loss term; and synchronously updating parameters of the first detection model, the second detection model, the shared calibration module and the fusion task head according to the multi-objective loss.
12. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-11.
13. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method according to any of claims 1-11.
14. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the method of any of claims 1-11.

Description

Digital human video detection method and device Technical Field The invention relates to the technical field of visual data processing, in particular to a digital human video detection method and device. Background In the context of rapid development of generative artificial intelligence technology, synthetic digital human video (e.g., virtual anchor, video on demand, etc.) has been widely used in many types of scenes. Such video is typically synthesized by depth generation models, whose visual quality is affected by a number of factors, such as facial detail distortion, mouth shape and speech dyssynchrony, timing discontinuities, and the like. The current evaluation of digital human video quality still mainly depends on manual subjective scoring, and has low efficiency and difficult scale. Although partial automatic detection methods try to introduce computer vision and deep learning technologies, in the aspects of generating content diversity, artifact type complexity, human perception characteristic modeling and the like, the existing methods generally have the problems of insufficient generalization capability, insufficient sensitivity to fine granularity distortion, difficulty in considering short-time details and long-time consistency and the like, so that the evaluation result has larger deviation from human subjective perception. Disclosure of Invention In view of this, the embodiment of the invention provides a digital human video detection method and device, so as to process video clips of different time scales by introducing a multi-branch detection model with heterogeneous structure, effectively fuse multi-view evaluation results, avoid missed judgment or misjudgment of complex generation defects by a single model, thereby improving the comprehensiveness and robustness of the detection result and reflecting the true generation quality of digital human video more accurately. In a first aspect, a digital human video detection method is provided, the method comprising: acquiring a digital human video generated by a generated model; extracting a plurality of frames from the digital human video to form a first frame sequence; Extracting a plurality of frames from the digital human video to form a second frame sequence, wherein the length of the first frame sequence is different from the length of the second frame sequence; Inputting the first frame sequence into a first detection model to determine a first detection score, wherein the first detection model comprises a first backbone network and a first task head network; Inputting the second frame sequence into a second detection model to determine a second detection score, wherein the second detection model comprises a second backbone network and a second task head network, and the second backbone network and the first backbone network have different model structures; Determining a target detection score of the digital human video at least according to the first detection score and the second detection score, wherein the target detection score is used for representing the generation quality of the digital human video. In a second aspect, there is provided a digital human video detection apparatus, the apparatus comprising: The acquisition module is used for acquiring the digital human video generated by the generation type model; the first extraction module is used for extracting a plurality of frames from the digital human video to form a first frame sequence; A second extraction module, configured to extract a plurality of frames from the digital human video to form a second frame sequence, where a length of the first frame sequence and a length of the second frame sequence are different; a first determining module, configured to input the first frame sequence into a first detection model to determine a first detection score, where the first detection model includes a first backbone network and a first task head network; A second determining module, configured to input the second frame sequence into a second detection model to determine a second detection score, where the second detection model includes a second backbone network and a second task head network, and the second backbone network and the first backbone network have different model structures; and a third determining module, configured to determine a target detection score of the digital human video according to at least the first detection score and the second detection score, where the target detection score is used to characterize the generation quality of the digital human video. In a third aspect, there is provided an electronic device comprising a memory for storing one or more computer program instructions, and a processor, wherein the one or more computer program instructions are executable by the processor to implement the method as described in the first aspect above. In a fourth aspect, a computer readable storage medium is provided, in which a computer program is stored, whi