CN-122023613-A - Virtual video reloading method and system based on mask guidance of whole body of human body

CN122023613ACN 122023613 ACN122023613 ACN 122023613ACN-122023613-A

Abstract

A virtual video reloading method and system based on human body whole body mask guidance extracts a human body whole body mask sequence, a skeleton key point sequence and scene prompt words describing a video scene from a video to be reloaded, constructs a video generation network based on a diffusion converter DiT, splices the human body whole body mask sequence and the skeleton key point sequence of a designated person with Noisy Latent in the channel dimension as input of a DiT network, processes the scene prompt words and the visual feature codes of the clothing to be reloaded through a cross attention mechanism, and then injects the processed scene prompt words and the visual feature codes into a converter module of the DiT network as content and style guidance of the generated video, and the DiT network executes a reverse diffusion process under multiple guidance to generate the reloaded video to complete the virtual reloading task of the person. The invention obviously improves the quality, diversity and controllability of video reloading generation, and provides a new technology and a new paradigm for reloading models in practical application.

Inventors

Shao Dingbao
YI ZILI
TAI YING
WU SONG
WANG QIAN

Assignees

南京大学
中移九天人工智能科技(北京)有限公司
中国移动通信集团有限公司
中国移动通信集团江苏有限公司

Dates

Publication Date: 20260512
Application Date: 20260202

Claims (8)

1. A virtual video reloading method based on human body whole body mask guidance is characterized in that target clothes are rendered on a designated person in a video to complete a virtual reloading task of the person, and the method comprises the following steps: 1) Extracting a guide signal, and extracting a human body whole body mask sequence, a skeleton key point sequence and scene prompt words for describing a video scene from a video to be replaced; 2) Video generation, namely, constructing a video generation network based on a diffusion converter DiT, splicing a human body whole body mask sequence and a skeleton key point sequence of a designated person with a noisy video hidden space representation Noisy Latent in a channel dimension, taking the video generation network as input of a DiT network, processing a scene prompt word and visual feature codes of clothes to be replaced through a cross attention mechanism, then injecting the scene prompt word and the visual feature codes into a converter module of the DiT network, taking the scene prompt word and the visual feature codes as content and style guidance of the generated video, and carrying out a reverse diffusion process on the DiT network under multiple guidance to generate the replaced video.
2. The virtual reloading method based on human body whole body mask guidance according to claim 1, wherein the human body whole body mask sequence is extracted by dividing the human body whole body mask of a specified person in a video frame by using a SAM2 division model, generating masks of all the persons by using a yolo x-seg model, and finally calculating an IOU value between the two masks as a final human body whole body mask.
3. The virtual reloading method based on human body whole body mask guidance of claim 1, wherein the extraction of the skeleton key point sequence is specifically that according to the video frame and the human body whole body mask of the appointed person, the skeleton key point sequence of all the persons in the video frame is generated by using Dwpose model, and then the corresponding skeleton key point sequence is screened out by using the human body whole body mask of the appointed person.
4. The virtual reloading method based on human body whole body mask guidance according to claim 1 is characterized in that extracting the prompting words describing the scene is specifically that inputting the video to be reloaded into a visual language model to generate a text prompt describing the whole scene of the video.
5. The virtual reloading method based on human body whole body mask guidance according to claim 1, wherein the human body whole body mask sequence and the skeleton key point sequence of the appointed person are spliced with Noisy Latent in the channel dimension, namely, the skeleton key point sequence of the appointed person and the corresponding appointed person image sequence are spliced with the human body whole body mask channel after being subjected to a variation self-encoder, noisy Latent is added in the splicing, an input token Input Tokens is generated after the blocking operation, and meanwhile, a clothing token Garment Tokens is generated by a clothing encoder Garment Encode through a tiling image of the clothing to be reloaded, and Garment Tokens and Input Tokens are added as input of a DiT network.
6. The virtual changing method based on human body whole body mask guiding according to claim 1, wherein the scene prompt words and the clothing feature codes of the clothing to be changed are obtained through a cross attention mechanism, the line drawings of the clothing to be changed are obtained through the flat drawings of the clothing to be changed, the scene prompt words are encoded through a text encoder Text Encoder to obtain the prompt word feature codes, the flat drawings of the clothing to be changed are encoded through a contrast language-image pre-training encoder CLIP Encoder and a clothing encoder Garment Encoder to obtain the clothing visual feature codes, the line drawings of the clothing to be changed are encoded through a line encoder Line Encoder to obtain the clothing line feature codes, the prompt word feature codes and the visual and line feature codes of the clothing are used as keys, the input is used as Query, and the Query is processed through the cross attention mechanism and then is input into a converter module of a DiT network.
7. The virtual reloading method based on human body whole body mask guidance according to claim 1, wherein the training data set of the video generating network is constructed by the following method: 1) Selecting video segments suitable for virtual reloading tasks from real world video data, and preprocessing, wherein the preprocessing comprises removing watermarks and special effects, screening video segments with character movement speeds within a set range, screening video segments with character picture proportion within the set range, and evaluating and rejecting video segments with low visual quality through a pre-trained aesthetic scoring model; 2) The method comprises the steps of generating a pairing video triplet, namely, a source video, a clothing image to be replaced, and a generated video after replacement, wherein the generated video after replacement is used as a data unit of a training data set, and specifically, the source video obtained in the step 1) and a clothing image to be replaced are used as input by utilizing a multi-mode video content replacement model, and only the clothing region in the video is subjected to content replacement on the premise that the identity, the action and the background of a person are completely consistent, so that the video after replacement paired with a source video segment is generated; 3) Checking the paired video triples obtained in the step 2), screening to obtain final data units to form a training data set, wherein the method comprises the following steps: extracting character masks in the video through a human body segmentation model, positioning a background area, calculating pixel mean square error MSE of the background area between a source video and a generated video, and eliminating pairing video triples with errors exceeding a preset threshold value to filter samples with polluted backgrounds; and (3) systematic manual verification, namely manually checking through an automatic background consistency check sample, wherein the manual inspection comprises clothing reality and details, character identity maintenance, time sequence consistency and edge processing naturalness.
8. A virtual video reloading system based on human body whole body mask guidance is characterized by comprising a guidance signal extraction module and a video generation module, The guide signal extraction module is used for extracting three complementary guide signals of a human body whole body mask sequence, a skeleton key point sequence and a scene prompt word from a source video and target clothes; the video generation module is used for generating a video after the replacement of the clothing based on the extracted guide signals and the images of the clothing to be replaced.

Description

Virtual video reloading method and system based on mask guidance of whole body of human body Technical Field The invention belongs to the technical field of artificial intelligence, relates to a video processing technology, and discloses a video virtual reloading method and system based on mask guidance of a whole body of a human body. Background Video virtual changing (Video Virtual Try-on, VVT) techniques aim to truly render target garments into the video of a given person while maintaining consistency of person identity, action, and context. In recent years, a Diffusion Model (DM) -based generation method has become the mainstream. The prior art is mainly divided into two major forms: 1. Scheme based on garment Mask (Mask-based) This approach is currently the main stream, and the core process is to take video reloading as video inpaint task. A mask corresponding to the reloading area is used for limiting the reloading area, so that the effects of protecting character features and backgrounds and replacing corresponding clothes are achieved. Most schemes use a target garment replacement, character posture information and other condition-guided diffusion models for controllable garment replacement, the scheme can better keep garment details, and part of schemes use text description to define garment patterns, so that the mode has high flexibility, but the details of the garment such as textures, patterns and the like are difficult to accurately control. 2. Mask-free scheme: in order to solve the problem of difficult acquisition of clothing masks, the method does not depend on accurate clothing masks, a powerful Mask-based video reloading model is generally used for generating a large amount of pseudo pairing data, and the model has the capacity of identifying reloading areas by training. Such schemes, while not requiring the use of a Mask at the time of use, need to be based on a Mask-based scheme, still suffer from the drawbacks of the Mask-based scheme. In the existing Mask-based scheme technology, a video reloading scheme based on clothing Mask and clothing image guidance is mainly represented by MagicTryOn, and the implementation steps of the scheme are as follows. 1) And inputting, namely receiving a character video with a corresponding clothing region erased, a target clothing image, a corresponding gesture information video and a descriptive text. 2) Encoding, namely encoding corresponding video and image into potential space by using a VAE model, and encoding text information and reference clothes information by using text_encoder and clip_encoder 3) And (3) video generation, namely inputting the encoded information into a DIT-based video generation network, and denoising the network through a dispatcher by predicting noise to obtain a final video after replacement. In addition, there are ViViD, catvton methods, some of which use a U-net framework, some of which use a DIT framework, and some of which are supplemented with different control conditions, but the core is guided by a clothing mask as a reloading area. In summary, in the prior art, both the mask-based video reloading paradigm and the non-mask video reloading paradigm have respective drawbacks: 1. Pollution problem in the reloading area: This is a fundamental problem for all Mask-based methods. The effectiveness of such video reloading models depends largely on the accuracy and timing stability of the garment masks. The existing method for making the clothes mask usually utilizes a human body analysis model to output the mask of the corresponding part, and then processes the input video frame by frame to obtain the clothes mask of the video. In real scenes such as outdoors, illumination changes, shadows, complex gestures and object shielding can all cause low mask quality, edge flickering and frame-to-frame inconsistency of clothing segmentation model output. Inaccurate masking can directly lead to reload area errors or create noticeable visual artifacts. 2. Human body information destruction problem: Existing video reloading tasks are generally considered video inpainting tasks, which follow the "wipe" followed by "repair" paradigm, but this paradigm physically destroys the original person's body structure information and the background information immediately adjacent to the person during the "wipe" step. Therefore, the model must rely on the pose information to compensate for the lost identity information, and inaccurate pose information often introduces deviations, which results in distortion of the figure's posture and unnatural fit of the garment to the body. 3. Background consistency problem: The existing maskless approach can avoid the above problems, but it introduces new problems. The maskless approach is prone to "flooding" generation of models when dealing with complex backgrounds due to lack of explicit spatial guidance, resulting in the person-independent background areas being modified indefinitely. In addition, in a multi-