CN-121985199-A - Diffusion model video local editing method and system based on mask guidance

CN121985199ACN 121985199 ACN121985199 ACN 121985199ACN-121985199-A

Abstract

The invention provides a diffusion model video local editing method and system based on mask guidance. The method comprises the steps of encoding a video sequence of a target video to obtain a first video frame latent space feature, carrying out coarse-granularity mask labeling on a target editing area in the target video to obtain mask information, determining a space attention weight according to the first video frame latent space feature, determining a second video frame latent space feature according to the first video frame latent space feature, the mask information and the space attention weight, determining a time attention weight according to the second video frame latent space feature, determining a third video frame latent space feature according to the mask information, the second video frame latent space feature and the time attention weight, and decoding the third video frame latent space feature to generate an edited video. The invention can effectively enhance the space-time consistency of local video editing without accurate mask frame by frame, ensure that the unedited area is kept unchanged and realize high-quality and stable video editing effect.

Inventors

CAO LI
DONG CHAO
ZHAO YANG
LI XINJIE
LIU XIAOPING

Assignees

合肥工业大学

Dates

Publication Date: 20260505
Application Date: 20260210

Claims (10)

1. A diffusion model video local editing method based on mask guidance is characterized by comprising the following steps: Performing VAE coding on a video sequence of the target video to obtain video frame latent space characteristics of the target video, wherein the video frame latent space characteristics are used as first video frame latent space characteristics; Performing coarse-granularity mask labeling on a target editing area in a target video to obtain mask information; Determining a spatial attention weight according to the first video frame latent spatial feature; Determining the first video frame latent space characteristic of the spatial enhancement according to the first video frame latent space characteristic, the mask information and the spatial attention weight to serve as a second video frame latent space characteristic; determining a temporal attention weight according to the second video frame latent space feature; Determining the second video frame latent space characteristic of space-time enhancement according to the mask information, the second video frame latent space characteristic and the time attention weight to serve as a third video frame latent space characteristic; and decoding the third video frame latent space feature to generate an edited video, wherein the target editing area is subjected to content modification and the unedited area is kept unchanged.
2. The diffusion model video local editing method of claim 1, wherein said determining a spatial attention weight from the first video frame latent spatial feature comprises: the spatial attention weight W s is calculated according to the following formula: ; Wherein F is a first video frame latent space feature, max (·) represents taking the maximum value in the channel dimension, avg (·) represents taking the average value in the channel dimension, combined [ a, b ] represents splicing the results of a and b in the channel dimension, conv 7×7 (·) represents sending the multi-channel feature map into a 7×7 convolution layer, outputting a single-channel feature map, and sigma (·) represents a sigmoid activation function.
3. The diffusion model video local editing method according to claim 1, wherein said determining the spatially enhanced first video frame latent spatial feature as the second video frame latent spatial feature based on the first video frame latent spatial feature, mask information and spatial attention weight comprises: the second video frame latent spatial feature F s is calculated according to the following formula: F s =F⊙(W s ⊙M); Wherein F is the latent spatial feature of the first video frame, as indicated by the dot product, W s is the spatial attention weight, and M is the mask information.
4. The diffusion model video local editing method of claim 1, wherein said determining a temporal attention weight from the second video frame latent spatial feature comprises: The time attention weight W t is calculated according to the following formula: ; Wherein σ (-) represents a sigmoid activation function, F (-) represents a feature transformation module comprising two 3D convolution layers for processing video frames and one activation function layer, avgPool D H,W (-) represents a 3D average pooling operation, F s is a second video frame latent spatial feature, maxPool D H,W (-) represents a 3D maximum pooling operation, H is the height dimension of the feature, and W is the width dimension of the feature.
5. The diffusion model video local editing method according to claim 1, wherein said determining the spatiotemporal enhanced second video frame latent spatial feature as the third video frame latent spatial feature based on the mask information, the second video frame latent spatial feature and the temporal attention weight comprises: The third video frame latent spatial feature F st is calculated according to the following formula: F st =F s ⊙(W t ⊙M); Wherein F s is the latent spatial feature of the second video frame, wherein, the term as used herein indicates dot product, W t is the time attention weight, and M is mask information.
6. The diffusion model video local editing method according to claim 1, further comprising: And constructing a noise prediction loss function to optimize the second video frame latent space feature and the third video frame latent space feature, wherein the noise prediction loss function is based on a noise prediction task of a diffusion model and minimizes the Euclidean distance between the prediction noise and the real noise.
7. The diffusion model video local editing method of claim 6 wherein said constructing a noise prediction loss function to optimize a second video frame latent spatial feature and a third video frame latent spatial feature comprises: Constructing an expression of a noise prediction Loss function Loss: ; wherein E represents expectations, z, epsilon-N (0, 1) is the video frame latent space characteristic z and the real added noise epsilon obeys standard normal distribution, and t is a diffusion time step; z t is the noise latent characteristic of the time step t, c' is the text condition; Is the square of the euclidean norm.
8. A diffusion model video local editing system based on mask guidance, comprising: The encoding module is used for carrying out VAE encoding on the video sequence of the target video to obtain the video frame latent space characteristic of the target video, and the video frame latent space characteristic is used as a first video frame latent space characteristic; the mask labeling module is used for carrying out coarse-granularity mask labeling on a target editing area in the target video to obtain mask information; the first determining module is used for determining the space attention weight according to the first video frame latent space characteristics; The second determining module is used for determining the first video frame latent space characteristic of the space enhancement according to the first video frame latent space characteristic, the mask information and the space attention weight to serve as a second video frame latent space characteristic; The third determining module is used for determining the time attention weight according to the second video frame latent space characteristics; A fourth determining module, configured to determine, according to the mask information, the second video frame latent spatial feature, and the temporal attention weight, a second video frame latent spatial feature of spatio-temporal enhancement as a third video frame latent spatial feature; And the decoding module is used for decoding the third video frame latent space characteristic to generate an edited video, wherein the target editing area is subjected to content modification and the unedited area is kept unchanged.
9. A computer device comprising a processor and a memory, wherein the processor, when executing a computer program stored in the memory, implements the steps of the mask-guided diffusion model video local editing method according to any of claims 1-7.
10. A computer readable storage medium for storing a computer program which, when executed by a processor, implements the steps of the mask-guided diffusion model video local editing method of any of claims 1-7.

Description

Diffusion model video local editing method and system based on mask guidance Technical Field The invention belongs to the technical field of image processing and computer vision, and particularly relates to a diffusion model video local editing method and system based on mask guiding. Background With the development of computer vision and deep learning technologies, a video content generation and editing method based on a diffusion model is becoming the mainstream. In video editing tasks, users often want to perform content replacement only on local areas in the video, such as changing the appearance, texture, or shape of an object, without affecting other unedited areas. Traditional video editing mainly relies on manual frame-by-frame modification or content propagation modes based on optical flow, but such methods are complex to operate, time-consuming and labor-consuming, and are prone to obvious visual artifacts and structural distortions in videos with large motion or occlusion changes. Therefore, realizing high-quality video local editing without manual processing frame by frame has important application value. The diffusion model is highlighted in the image generation and editing task, and the semantic replacement of video content is realized by fine tuning the image diffusion model by a part of text-based editing method. However, such methods generally assume that the edited region is a main semantic object in the video, and it is often difficult to achieve precise control for editing of local and non-main regions in complex scenes, and problems of editing region offset, semantic mismatch, and erroneous modification of unedited regions easily occur. In addition, when the editing area changes obviously with time, due to lack of cross-frame consistency modeling, consistency defects such as texture shake, shape deformation and the like are easy to generate on the time dimension of the generated result. In order to improve the controllability of editing local areas, some methods introduce mask models to define the editing areas. However, these methods generally require precise mask frame by frame, are costly to operate, and are difficult to use on a large scale. At the same time, most current methods do not explicitly model the spatio-temporal feature correlations of mask regions inside the diffusion model, resulting in local content that, although can be replaced, remains difficult to keep consistent in video sequences. Disclosure of Invention The invention provides a diffusion model video local editing method and a diffusion model video local editing system based on mask guidance, which can enhance the space-time consistency and semantic expression capability of a video local editing region and ensure that an unedited region in the video is not damaged on the premise of not needing a frame-by-frame accurate mask. In a first aspect, the present invention provides a method for locally editing a diffusion model video based on mask guidance, including: Performing VAE coding on a video sequence of the target video to obtain video frame latent space characteristics of the target video, wherein the video frame latent space characteristics are used as first video frame latent space characteristics; Performing coarse-granularity mask labeling on a target editing area in a target video to obtain mask information; Determining a spatial attention weight according to the first video frame latent spatial feature; Determining the first video frame latent space characteristic of the spatial enhancement according to the first video frame latent space characteristic, the mask information and the spatial attention weight to serve as a second video frame latent space characteristic; determining a temporal attention weight according to the second video frame latent space feature; Determining the second video frame latent space characteristic of space-time enhancement according to the mask information, the second video frame latent space characteristic and the time attention weight to serve as a third video frame latent space characteristic; and decoding the third video frame latent space feature to generate an edited video, wherein the target editing area is subjected to content modification and the unedited area is kept unchanged. Optionally, the determining the spatial attention weight according to the first video frame latent spatial feature includes: the spatial attention weight W s is calculated according to the following formula: ; Wherein F is a first video frame latent space feature, max (·) represents taking the maximum value in the channel dimension, avg (·) represents taking the average value in the channel dimension, combined [ a, b ] represents splicing the results of a and b in the channel dimension, conv 7×7 (·) represents sending the multi-channel feature map into a 7×7 convolution layer, outputting a single-channel feature map, and sigma (·) represents a sigmoid activation function. Optionally, the determ