CN-121418573-B - Video processing method and related equipment based on multi-condition control diffusion model

CN121418573BCN 121418573 BCN121418573 BCN 121418573BCN-121418573-B

Abstract

The application discloses a video processing method and related equipment based on a multi-condition control diffusion model, wherein the method comprises the steps of adopting a double-criterion detection strategy to extract frames of an original video to obtain a target key frame sequence; the method comprises the steps of dividing an original video according to a target key frame sequence to obtain a plurality of original video segments, extracting and compressing static features of each original video segment to obtain static condition data, extracting and compressing dynamic features of each original video segment to obtain dynamic condition data, preprocessing the dynamic condition data to obtain a dynamic condition noise tensor, inputting the static condition data and the dynamic condition noise tensor into a multi-condition control diffusion model to carry out condition diffusion reconstruction, and outputting a plurality of reconstructed video segments. The application can realize fine regulation and control of the perceived quality and the compression ratio, improve the perceived quality and the compression efficiency, and improve the detail reduction capability, and can be widely applied to the technical field of data processing.

Inventors

LI XUELONG
Yi Fangqiu
XU JINGYU
ZHANG CHI

Assignees

中电信人工智能科技(北京)有限公司

Dates

Publication Date: 20260512
Application Date: 20251226

Claims (9)

1. A video processing method based on a multi-condition controlled diffusion model, the method comprising the steps of: acquiring an original video, and carrying out frame extraction processing on the original video by adopting a double-criterion detection strategy to obtain a target key frame sequence; Dividing the original video according to the target key frame sequence by adopting a segment dividing algorithm to obtain a plurality of original video segments; Extracting and compressing static characteristics of each original video segment to obtain static condition data, wherein the static condition data comprises key frame compression data and text description data; Extracting and compressing dynamic characteristics of each original video segment to obtain dynamic condition data, wherein the dynamic condition data comprises a segmentation sequence, human motion data and optical flow field data; performing data preprocessing on the dynamic condition data to obtain a dynamic condition noise tensor; inputting the static condition data and the dynamic condition noise tensor into a multi-condition control diffusion model to perform condition diffusion reconstruction processing, outputting a plurality of reconstructed video segments, and generating a reconstructed video based on the plurality of reconstructed video segments; The data preprocessing is performed on the dynamic condition data to obtain a dynamic condition noise tensor, which comprises the following steps: converting the dynamic condition data into visual mode data, and inputting the visual mode data into a pre-training encoder for encoding processing to obtain an initial low-dimensional latent variable; And converting the initial low-dimensional latent variable into a target low-dimensional latent variable by adopting a dynamic condition control strategy, and splicing the target low-dimensional latent variable with noise to obtain the dynamic condition noise tensor, wherein the dynamic condition control strategy comprises a random discarding mechanism and a modal embedding mechanism, the modal embedding mechanism is used for distinguishing the role states of the conditions, and the role states comprise normal conditions and discarding states.
2. The method according to claim 1, wherein the dual-criterion detection strategy includes a shot boundary detection method and a fixed interval sampling method, the acquiring an original video, and performing frame extraction processing on the original video by using the dual-criterion detection strategy to obtain a target key frame sequence, and the method includes: acquiring the original video; Calculating inter-frame shot switching probability in the original video by adopting the shot boundary detection method, and determining a first key frame sequence according to the inter-frame shot switching probability and a preset switching probability threshold; The fixed interval sampling method is adopted, key frames are marked in the original video according to preset frame intervals, and a second key frame sequence is obtained; and constructing the target key frame sequence according to the first key frame sequence and the second key frame sequence.
3. The method according to claim 1, wherein the performing static feature extraction and compression on each of the original video segments to obtain static condition data includes: performing frame feature extraction processing on each original video segment to obtain a head key frame and a tail key frame in each original video segment; compressing the first key frame and the tail key frame in each original video segment through an image encoder to obtain key frame compressed data; Performing text extraction processing on each original video segment through a multi-mode model to obtain text description data; And constructing the static condition data according to the key frame compression data and the text description data.
4. The method of claim 1, wherein the performing dynamic feature extraction and compression on each of the original video segments to obtain dynamic condition data comprises: performing panoramic segmentation on video frames in each original video segment to obtain a panoramic segmentation map of each video frame, and extracting an object contour of the panoramic segmentation map of each video frame; fitting each object contour by adopting a Bezier curve to obtain the segmentation sequence; Extracting three-dimensional joint sequences in each original video segment, and projecting the three-dimensional joint sequences in each original video segment into two-dimensional coordinate data; Filtering the two-dimensional coordinate data in each original video segment according to a preset action area threshold value to obtain the human motion data; Calculating dense optical flow in each original video segment to obtain optical flow field data; the dynamic condition data is constructed from the segmentation sequence, the human motion data, and the optical flow field data.
5. The method of claim 1, wherein the converting the dynamic condition data into the visual mode data and inputting the visual mode data into a pre-training encoder for encoding processing to obtain an initial low-dimensional latent variable comprises: Rendering the object contour subjected to Bezier curve fitting in the segmentation sequence into a binary mask sequence; connecting two-dimensional joint points in the human body motion data to generate a skeleton sequence, and visualizing the skeleton sequence into a motion trail graph evolving along with time; mapping motion vectors in the optical flow field data into an RGB image sequence by a color coding method; Constructing the visual mode data according to the binary mask sequence, the motion trail graph and the RGB image sequence; inputting the visual mode data into the pre-training encoder for coding processing to obtain the initial low-dimensional latent variable.
6. A video processing apparatus based on a multi-condition controlled diffusion model, the apparatus comprising: the key frame sequence extraction module is used for acquiring an original video, and carrying out frame extraction processing on the original video by adopting a double-criterion detection strategy to obtain a target key frame sequence; The video segment segmentation module is used for carrying out segmentation processing on the original video according to the target key frame sequence by adopting a segment segmentation algorithm to obtain a plurality of original video segments; the static condition data acquisition module is used for carrying out static feature extraction and compression on each original video segment to obtain static condition data, wherein the static condition data comprises key frame compression data and text description data; the dynamic condition data acquisition module is used for carrying out dynamic feature extraction and compression on each original video segment to obtain dynamic condition data, wherein the dynamic condition data comprises a segmentation sequence, human motion data and optical flow field data; The conditional noise tensor acquisition module is used for carrying out data preprocessing on the dynamic conditional data to obtain a dynamic conditional noise tensor; the conditional diffusion reconstruction module is used for inputting the static condition data and the dynamic condition noise tensor into a multi-condition control diffusion model to perform conditional diffusion reconstruction processing, outputting a plurality of reconstructed video segments and generating a reconstructed video based on the plurality of reconstructed video segments; The conditional noise tensor acquisition module is specifically configured to: converting the dynamic condition data into visual mode data, and inputting the visual mode data into a pre-training encoder for encoding processing to obtain an initial low-dimensional latent variable; And converting the initial low-dimensional latent variable into a target low-dimensional latent variable by adopting a dynamic condition control strategy, and splicing the target low-dimensional latent variable with noise to obtain the dynamic condition noise tensor, wherein the dynamic condition control strategy comprises a random discarding mechanism and a modal embedding mechanism, the modal embedding mechanism is used for distinguishing the role states of the conditions, and the role states comprise normal conditions and discarding states.
7. An electronic device comprising a memory storing a computer program and a processor implementing the method of any one of claims 1 to 5 when the computer program is executed by the processor.
8. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 5.
9. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 5.

Description

Video processing method and related equipment based on multi-condition control diffusion model Technical Field The application relates to the technical field of data processing, in particular to a video processing method and related equipment based on a multi-condition control diffusion model. Background The current video compression technology is mainly divided into two types of a traditional coding standard and a nerve compression method based on deep learning, but the traditional method has obvious limitations that the traditional method distorts indexes to sacrifice perceived quality, and the nerve method focuses on code rate control to ignore detail generation. In addition, although generating a generating model such as a network (GENERATIVE ADVERSARIAL Networks, abbreviated as GAN) and the like can improve visual fidelity, mode collapse or artifact is easy to occur, and the compression rate and the reconstruction quality are difficult to be balanced stably, so that the reconstruction quality of a dynamic scene is limited. In recent years, a diffusion model shows the potential of generating a high-fidelity result in image compression, but the application of the diffusion model in the video field still faces three major challenges, namely, the direct expansion of the image diffusion model can cause poor time sequence consistency and appear as inter-frame flicker or motion fracture, the iterative generation mechanism has high decoding delay and is difficult to put into practical use, and the adoption of a single condition to control the diffusion process can cause insufficient reconstruction fidelity. In summary, the technical problems in the related art are to be improved. Disclosure of Invention Embodiments of the present application aim to solve at least one of the technical problems in the related art to some extent. Therefore, the main purpose of the embodiment of the application is to provide a video processing method and related equipment based on a multi-condition control diffusion model, which can realize fine regulation and control of perceived quality and compression ratio and improve detail reduction capacity, thereby improving perceived quality and compression efficiency. To achieve the above object, an aspect of an embodiment of the present application provides a video processing method based on a multi-condition control diffusion model, the method including the steps of: acquiring an original video, and carrying out frame extraction processing on the original video by adopting a double-criterion detection strategy to obtain a target key frame sequence; Dividing the original video according to the target key frame sequence by adopting a segment dividing algorithm to obtain a plurality of original video segments; Extracting and compressing static characteristics of each original video segment to obtain static condition data, wherein the static condition data comprises key frame compression data and text description data; Extracting and compressing dynamic characteristics of each original video segment to obtain dynamic condition data, wherein the dynamic condition data comprises a segmentation sequence, human motion data and optical flow field data; performing data preprocessing on the dynamic condition data to obtain a dynamic condition noise tensor; inputting the static condition data and the dynamic condition noise tensor into a multi-condition control diffusion model to perform condition diffusion reconstruction processing, outputting a plurality of reconstructed video segments, and generating a reconstructed video based on the plurality of reconstructed video segments. In some embodiments, the dual-criterion detection strategy includes a shot boundary detection method and a fixed interval sampling method, and the acquiring the original video and performing frame extraction processing on the original video by adopting the dual-criterion detection strategy to obtain a target key frame sequence includes: acquiring the original video; Calculating inter-frame shot switching probability in the original video by adopting the shot boundary detection method, and determining a first key frame sequence according to the inter-frame shot switching probability and a preset switching probability threshold; The fixed interval sampling method is adopted, key frames are marked in the original video according to preset frame intervals, and a second key frame sequence is obtained; and constructing the target key frame sequence according to the first key frame sequence and the second key frame sequence. In some embodiments, the performing static feature extraction and compression on each original video segment to obtain static condition data includes: performing frame feature extraction processing on each original video segment to obtain a head key frame and a tail key frame in each original video segment; compressing the first key frame and the tail key frame in each original video segment through an image