CN-121998828-A - High-resolution video generation method based on diffusion model

CN121998828ACN 121998828 ACN121998828 ACN 121998828ACN-121998828-A

Abstract

The invention discloses a high-resolution video generation method based on a diffusion model, which comprises the following steps of S1, obtaining an input image sequence, extracting a multi-scale space feature map by utilizing ConvNeXtV, S2, carrying out gradient analysis and region segmentation on the multi-scale feature map to form a feature subdomain set, S3, carrying out boundary fusion and region reconstruction on the feature subdomain to generate a reconstructed feature map sequence, S4, encoding the reconstructed feature map sequence through LDM and applying forward diffusion noise to generate disturbance latent variables, S5, judging branch output difference feedback while generating branch prediction latent variables, S6, constructing an error signal by utilizing the judgment feedback and updating and generating branch parameters to obtain a final latent variable sequence, and S7, decoding the final latent variable sequence to generate a high-resolution video frame sequence. The method and the device realize accurate generation and detail enhancement of the high-resolution video, and improve the structure fidelity and texture reconstruction quality.

Inventors

YANG CHUANLONG
WANG CHENGLI
WANG XUCHAO

Assignees

国研能汇(北京)技术有限公司

Dates

Publication Date: 20260508
Application Date: 20260323

Claims (10)

1. The high-resolution video generation method based on the diffusion model is characterized by comprising the following steps of: S1, acquiring an input image sequence, performing feature extraction on the image sequence by using a spatial feature enhancement model ConvNeXt V, and outputting a plurality of scale spatial feature images; S2, performing a river basin splitting operation on the plurality of scale space feature graphs, dividing a plurality of feature sub-regions according to gradient changes and feature boundaries, and constructing a feature sub-region set; S3, carrying out structural reconstruction on the characteristic subdomain set, modeling and context fusion on the characteristics of each subdomain, and outputting a reconstructed characteristic map sequence; s4, constructing a latent space diffusion model LDM, encoding the reconstructed feature map sequence into a latent variable sequence, and performing a forward diffusion process to generate a disturbance latent variable sequence; S5, introducing a discrimination collaborative diffusion mechanism, inputting the disturbance latent variable sequence into a generating branch and a discrimination branch, predicting a reconstruction latent variable based on a back diffusion process by the generating branch, discriminating and optimizing the predicted latent variable and the original latent variable by the discrimination branch, and outputting discrimination feedback; S6, updating the generated branch parameters according to the discrimination feedback, minimizing the difference between the generated latent variable and the original latent variable through collaborative optimization, and outputting a final latent variable sequence; s7, reconstructing the final latent variable sequence by using a decoder, and outputting a high-resolution video frame sequence.
2. The high-resolution video generation method based on the diffusion model according to claim 1, wherein the S2 specifically comprises: S21, in a plurality of scale space feature map sets, performing first-order difference operation on each scale space feature map in the horizontal direction and the vertical direction respectively to obtain a horizontal direction gradient map and a vertical direction gradient map, and obtaining a corresponding gradient amplitude map by summing the horizontal direction gradient map and the vertical direction gradient map pixel by pixel square and then squaring the sum; s22, setting a fixed threshold value in each gradient amplitude map, marking pixels with gradient amplitude larger than or equal to the fixed threshold value as boundary pixels, marking pixels with gradient amplitude smaller than the fixed threshold value as non-boundary pixels, and forming a boundary marking map corresponding to the gradient amplitude map one by one; S23, classifying adjacent boundary pixels into the same boundary communication area, classifying adjacent non-boundary pixels into the same non-boundary communication area, and marking out a plurality of feature blocks on each scale space feature map by taking the space ranges of the boundary communication area and the non-boundary communication area as boundaries, wherein each feature block corresponds to a continuous pixel area; S24, calculating four statistics of mean value, variance, maximum value and minimum value of all pixel channel values in each feature block, and dividing the feature block into one of texture feature subdomains, flat feature subdomains and edge feature subdomains according to a preset interval in which the values of the four statistics fall to obtain a plurality of texture feature subdomains, a plurality of flat feature subdomains and a plurality of edge feature subdomains; s25, arranging texture feature subfields, flat feature subfields and edge feature subfields obtained from all scale space feature graphs according to the scale size and the space position order to form a feature subfield set.
3. The high-resolution video generation method based on the diffusion model according to claim 2, wherein the step S3 specifically comprises: S31, sequentially reading each characteristic subdomain in the characteristic subdomain set according to the arrangement sequence, and mapping each characteristic subdomain back to the original space position in the corresponding scale space characteristic map to form a subdomain position map corresponding to the scale; S32, constructing a subdomain adjacent list for all the characteristic subdomains in each subdomain position diagram according to the space adjacent relation, and constructing a subdomain pairing relation for any two characteristic subdomains with public boundaries in the subdomain adjacent list; S33, extracting boundary pixels from the characteristic subdomains of each pair of subdomain pairing relation, and arranging the boundary pixels along a common boundary direction according to a pixel index sequence to form a boundary pixel sequence; S34, carrying out weighted fusion processing on the corresponding pixel channel values of the characteristic subfields at two sides in the boundary pixel sequence to form a reconstructed boundary pixel sequence; S35, overwriting the reconstructed boundary pixel sequence into a sub-domain boundary region at a corresponding position, and executing overwriting processing in the same mode on all sub-domain boundary regions of each scale space feature map; S36, in each scale space feature map, combining all feature subdomains and all reconstructed boundary areas according to a space sequence to generate a reconstructed feature map; S37, arranging a plurality of reconstructed feature maps according to the sequence from large scale to small scale to form a reconstructed feature map sequence.
4. The high-resolution video generation method according to claim 3, wherein the weighted fusion process in S34 specifically comprises: In each pair of sub-domain pairing relations, pixel variances of each channel are calculated for the two side characteristic subdomains of the corresponding boundary pixels respectively, the inverse of the channel variances of the two sides are normalized according to the channel dimension to obtain channel weighting coefficients, and element-by-element weighted summation is carried out on the channel weighting coefficients and the channel values of the boundary pixels of the two sides of the corresponding channels respectively to obtain a reconstructed boundary pixel sequence.
5. The method for generating high-resolution video based on diffusion model according to claim 4, wherein said ConvNeXt V model specifically comprises: an input layer for receiving the processed image frames and outputting an input tensor; Each stage comprises a depth separable convolution unit, a normalization unit and an activation unit; Sequentially generating a plurality of scale space feature images through depth separable convolution of different step sizes; and performing structural reconstruction on the feature subdomain sets to generate a reconstructed feature map sequence.
6. The high-resolution video generation method based on the diffusion model according to claim 5, wherein the LDM model specifically comprises: the encoder performs downsampling and feature compression on the reconstructed feature map sequence to generate a latent variable sequence; The diffusion noise applying unit is used for applying noise disturbance in the latent variable sequence according to a preset diffusion step number to generate a disturbance latent variable sequence; generating a branch, which consists of a plurality of residual blocks and self-attention units, and executing a back diffusion process on the disturbance latent variable sequence and outputting a prediction latent variable; a discrimination branch composed of a convolution feature extraction layer and a discrimination unit, and receiving the predicted latent variable and the original latent variable and outputting a discrimination result; A collaborative optimization unit which performs updating of parameters of the generated branches according to the discrimination result; and the latent variable reconstruction decoder performs up-sampling and feature reduction on the co-optimized latent variables and outputs a final latent variable sequence.
7. The high-resolution video generation method based on the diffusion model according to claim 6, wherein the generating branches in S5 specifically include: In each diffusion step, the current disturbance latent variable sequentially passes through a preset number of residual blocks in the generated branch, the residual blocks execute convolution operation, normalization operation and activation operation on the current disturbance latent variable, and the operation output and the current disturbance latent variable are added to obtain an intermediate latent variable of the diffusion step; inputting the intermediate latent variable into a self-attention unit, and performing correlation calculation and weighted convergence on the characteristics of each position in the intermediate latent variable by the self-attention unit to output a predicted latent variable of the diffusion step; After the prediction of a certain diffusion step is completed, the predicted latent variable of the current diffusion step and the corresponding original latent variable are input into a judging branch; and after all diffusion steps are predicted, taking the predicted latent variable obtained in the last diffusion step as an output predicted latent variable sequence of the generated branch.
8. The high-resolution video generation method based on the diffusion model according to claim 7, wherein the discriminating branch in S5 specifically includes: In each diffusion step, performing element-by-element subtraction on the predicted latent variable and the original latent variable to obtain a difference characteristic tensor, and combining the difference characteristic tensor and the spliced characteristic tensor in the channel dimension to form a discrimination input tensor; sequentially inputting the discrimination input tensor into a plurality of convolution feature extraction layers, and performing convolution operation, normalization operation and activation operation on the discrimination input tensor by the convolution feature extraction layers to output a discrimination feature map; the discrimination feature map is input to a discrimination unit, and the discrimination unit performs linear transformation on the discrimination feature map to output discrimination feedback.
9. The high-resolution video generation method based on the diffusion model according to claim 8, wherein S6 specifically comprises: s61, in each diffusion step, performing element-by-element subtraction on the predicted latent variable which generates branch output and the corresponding original latent variable to obtain a difference characteristic tensor; s62, performing element-by-element absolute value operation on discrimination feedback output by a discrimination branch, and cutting off elements with absolute values larger than 1 into 1 to obtain a first weight graph; s63, performing element-by-element symbol extraction operation on discrimination feedback of discrimination branch output, marking an element larger than 0 as 1, marking an element smaller than or equal to 0 as-1, and obtaining a symbol diagram; s64, multiplying the difference characteristic tensor with the symbol graph element by element to obtain a direction correction tensor; s65, multiplying the direction correction tensor with the first weight graph element by element to obtain an error signal tensor; S66, performing back propagation operation on the error signal tensor, and updating parameters of the residual block and the self-attention unit in the generating branch.
10. The high-resolution video generation method based on the diffusion model according to claim 9, wherein S7 specifically comprises: And sequentially executing up-sampling operation and layer-by-layer convolution characteristic restoration operation in the latent variable reconstruction decoder to generate reconstruction characteristic frames corresponding to the original resolution, and arranging all the reconstruction characteristic frames in time sequence to output a high-resolution video frame sequence.

Description

High-resolution video generation method based on diffusion model Technical Field The invention relates to the field of computer vision, in particular to a high-resolution video generation method based on a diffusion model. Background The existing diffusion model generally realizes the generation of complex content by carrying out gradual denoising reasoning on images or videos in a latent space. The high-resolution video frame has larger space size and complex texture structure, and the problems of detail loss, boundary blurring or excessive smoothing of local structures easily occur in the multi-scale feature extraction process of the traditional convolution network, so that stable and high-fidelity latent space representation cannot be provided for a diffusion model. The existing diffusion model depends on a single chained denoising process, the generating capacity is mainly limited by the quality of an original latent variable and the expression capacity of a denoising network, and the difference between a predicted latent variable and a target latent variable cannot be subjected to fine granularity constraint in a diffusion reasoning process, so that the reconstruction of local textures, dynamic boundaries and detail areas in a generated result is insufficient. In the aspect of feature reconstruction, the prior art generally adopts a global convolution enhancement or unified interpolation mode to process multi-scale features, lacks fine granularity structure modeling capability aiming at space region difference, and cannot carry out domain optimization on the features according to texture complexity, edge strength or space flatness of different regions, so that structure distortion and cross-domain information interference occur in a multi-scale feature fusion stage. In the diffusion process, the existing method lacks a cooperative constraint mechanism between the generating branch and the judging branch, and cannot guide the generating branch to conduct directional correction in each diffusion step through judging feedback, so that the latent variable is insufficiently updated and the convergence speed is lower. Therefore, how to provide a high-resolution video generation method based on a diffusion model is a problem that needs to be solved by those skilled in the art. Disclosure of Invention The invention aims to provide a high-resolution video generation method based on a diffusion model, which carries out fine modeling and latent space optimization on multi-scale features of a video sequence through a spatial feature enhancement and discrimination collaborative diffusion mechanism to realize generation and structure restoration of high-quality video content and has the advantages of strong detail retaining capability and high generation precision. According to the embodiment of the invention, the high-resolution video generation method based on the diffusion model comprises the following steps of: S1, acquiring an input image sequence, performing feature extraction on the image sequence by using a spatial feature enhancement model ConvNeXt V, and outputting a plurality of scale spatial feature images; S2, performing a river basin splitting operation on the plurality of scale space feature graphs, dividing a plurality of feature sub-regions according to gradient changes and feature boundaries, and constructing a feature sub-region set; S3, carrying out structural reconstruction on the characteristic subdomain set, modeling and context fusion on the characteristics of each subdomain, and outputting a reconstructed characteristic map sequence; s4, constructing a latent space diffusion model LDM, encoding the reconstructed feature map sequence into a latent variable sequence, and performing a forward diffusion process to generate a disturbance latent variable sequence; S5, introducing a discrimination collaborative diffusion mechanism, inputting the disturbance latent variable sequence into a generating branch and a discrimination branch, predicting a reconstruction latent variable based on a back diffusion process by the generating branch, discriminating and optimizing the predicted latent variable and the original latent variable by the discrimination branch, and outputting discrimination feedback; S6, updating the generated branch parameters according to the discrimination feedback, minimizing the difference between the generated latent variable and the original latent variable through collaborative optimization, and outputting a final latent variable sequence; s7, reconstructing the final latent variable sequence by using a decoder, and outputting a high-resolution video frame sequence. Optionally, the S2 specifically includes: S21, in a plurality of scale space feature map sets, performing first-order difference operation on each scale space feature map in the horizontal direction and the vertical direction respectively to obtain a horizontal direction gradient map and a vertical direction gradien