CN-120050481-B - Method for generating graphically generated video based on training-free strategy and multi-subject attention alignment

CN120050481BCN 120050481 BCN120050481 BCN 120050481BCN-120050481-B

Abstract

The invention discloses a method for generating a photo-generated video based on training-free strategy and multi-subject attention alignment, and belongs to the technical field of photo-generated videos. According to the invention, different subjects are separated through subject sensing and attention unlocking processing, and the information of one subject is aligned when the attention is calculated, and other subjects are shielded, the frame characteristic and the text characteristic of each subject need to be aligned in the attention calculation, the unlock attention can extract each subject and align the text characteristic corresponding to each subject, and other subjects do not participate in the calculation when processing, so that each subject only represents actions corresponding to the text of the subject in the final result.

Inventors

LIU HENG
YAN YI

Assignees

安徽工业大学

Dates

Publication Date: 20260505
Application Date: 20250221

Claims (4)

1. The method for generating the graphically generated video based on the training-free strategy and the multi-subject attention alignment is characterized by comprising the following steps of: s1, acquiring input texts and pictures; S2, separating an input text, dividing an input picture to obtain a mask of each main body, and performing perception processing on the masks; S3, inputting the extracted mask, the separated text and the corresponding input picture into a diffusion model-based pre-training image-to-video (I2V) model; s4, coding the picture, copying a plurality of pictures, and adding noise to form noise; s5, sending the separated texts to a CLIP text encoder to extract the characteristics of each subject, copying each subject characteristic to the same number of times as the number of frames of the target generated video, and sequentially storing the copied subject characteristics in a list set In (a) and (b); s6, collecting the obtained set Inputting the mask of each main body area and noise into a U-Net network; s7, when calculating cross attention in the U-Net network, mask is used to select from a plurality of main body characteristics Is obtained for each subject Body features Text feature matching And First calculate by dot product And (3) with And obtaining an attention weight representing the correlation between each input and the other inputs, normalizing the attention weights, and applying the normalized attention weights to the corresponding value vectors to generate an attention output, wherein, And From a collection The acquisition of the data is performed, Represent the first A main body; S8, splicing a plurality of attention outputs, normalizing the attention outputs with the sum of all masks to obtain an attention DeAttention matched with a pre-training image to video (I2V) model based on a diffusion model to replace the attention in an original attention module in a U-Net network, and then obtaining a potential variable of one-step denoising; s9, generating final potential variables by using a classifier-free guidance method in a one-step denoising mode, and obtaining potential representations; S10, denoising repeatedly for T times, and gradually eliminating noise to obtain a target generated video; in the step S2, the specific processing procedure is as follows: S21, separating descriptive texts containing a plurality of subjects through input text separation processing, and independently storing the description of each subject in a set; s22, dividing each subject in the picture by using the separated text description through visual dividing processing to generate a corresponding subject region mask; And S23, performing sensing processing on each mask, namely setting the pixel value of the area corresponding to the main body to be 1 and the pixel values of the areas of other main bodies to be 0 in each mask, so that each mask accurately represents the area corresponding to the main body.
2. The method for generating a video for a picture based on training-free strategy and multi-subject attention alignment according to claim 1 wherein in step S4, the specific processing procedure is as follows: s41, converting an input picture into a potential variable through a pre-trained encoder, and then carrying out noise adding treatment on the potential variable; s42, copying the potential variables into a plurality of copies, so that the number of the potential variables is consistent with the number of frames of the target generated video; s43, adding noise of one frame number dimension to each latent variable; And S44, splicing the potential variables with noise along the frame number dimension, so as to obtain a potential representation suitable for video generation.
3. The method for generating a video for a picture based on training-free strategy and multi-subject attention alignment of claim 2, wherein in said step S7, subject characteristics The calculation formula of (2) is as follows: ; the calculation formula of the attention output is as follows: ; Wherein, the Is the first Attention output with text features of the individual subjects aligned with the subject features, Is that And Is a dimension of (c).
4. The method for generating a multi-subject attention-aligned, training-free video in accordance with claim 3 wherein in said step S9, the acquisition formula of the potential representation is as follows: ; Wherein, the Is the weight of the sample, and the weight of the sample, In the case of a U-Net network, Is a potential variable that is conditional and is, Is an unconditional latent variable.

Description

Method for generating graphically generated video based on training-free strategy and multi-subject attention alignment Technical Field The invention relates to the technical field of video generation, in particular to a method for generating a video based on training-free strategy and multi-subject attention alignment. Background Diffusion models have achieved significant achievements in the field of generation, particularly image generation, even beyond generating a countermeasure network (GAN). With advances in text-to-image generation technology, video generation tasks have also evolved rapidly. The Video Diffusion Model (VDM) expands 2D U-Net into a 3D U-Net structure for the first time, so that the joint training of images and videos is realized. In addition ANIMATEDIFF, by training a motion module to adapt to different personalized text-to-image (T2I) models, high quality video is generated in combination with other specialized content models. Text2VideoZero proposes a sampling method that does not require additional training, enhances motion dynamics while maintaining frame consistency, thereby generating video content that meets expectations. Custom video generation aims at generating highly personalized video content according to the specific needs or input conditions of users, mainly focusing on the text-to-video field. This is achieved by fine-tuning the generation of the video body using text with a specific symbol, and fine-tuning a specific motion using text representing the motion in the relevant video. The resulting video generated by the fine tuning text has the required body and motion to enable custom video generation. Existing custom video generation uses a fine tuning approach. For example CustomVideo uses DreamBooth fine tuning method to modulate its model with specific text representing the image. LAMP fine-tuning specific text actions, binding these actions with actions in the training video to achieve specific action fine-tuning. DreamVideo fine-tune spatial subject and temporal motion, respectively, to generate a video model with specific subject and motion. Recently, there have been multi-objective custom studies in which DisenStudio inputs and stably diffuses image data including a plurality of subjects, fine-tunes a specific text prompt, and distinguishes different subjects using masks to achieve text-driven motion of the different subjects. The customized video generation promotes the development of multi-target control in the aspects of main body and motion control, and enriches the field of video generation. Recently, some progress has been made in research on multi-subject generation. In the field of image generation, be Yourself significantly alleviates the problem of semantic confusion and misplacement of multiple subjects in text in a training-free manner when stable diffusion is generated. The diffusion from the mastered text to the image utilizes a large language model to automatically separate the main body in the text to form an independent prompt. Each prompt enters the UNet network at the same time, and the final potential results are spliced together according to specific dimensions, so that each main body is ensured not to interfere with each other, and multi-main-body image generation is realized. In the video field, multi-subject generation is also rapidly developing, mainly focusing on personalized custom video generation. For example DisenStudio and CustomVideo are each trained on a model by stitching multiple subjects into a single image. They use special placeholders to bind principals and masks to distinguish between different principals. The former uses a stitching method to combine different subjects based on the cross-attention of different texts instead of the original attention, and uses masks to distinguish one by one. The latter uses masks to distinguish over the cross-attention. Recent research in the field of video-on-graphics has made progress in the controlled motion of multiple subjects, such as Follow-you-Pose v2, which animates multiple objects in an image according to a multi-person gesture. The raw video typically animates the input image according to given conditions to achieve dynamic content from the image. With the development of video in graphics, several branches appear. For example, text-guided image-to-video generation is intended to generate video that matches both text semantics and image content. The human gesture guidance model converts a human image into a video with additional controls, such as dense gestures, depth maps, and the like. Image animation under optical flow conditions is similar to human gesture control animation, focusing mainly on character animation. Models that control motion in the mask region do this by controlling the motion of selected regions in the image, whereas in the track control field they control the direction of motion. These models typically combine clean images with or inject image information i