CN-122002099-A - SV2V method and device based on Wan2.2

CN122002099ACN 122002099 ACN122002099 ACN 122002099ACN-122002099-A

Abstract

The disclosure belongs to the technical field of multimedia content generation and editing, and provides a Wan 2.2-based SV2V method and device, wherein the method comprises the steps of extracting a face coordinate frame of each frame in an input video segment, cutting and scaling to a preset size, generating a first face video frame, and combining the first face video frame according to time sequence to obtain a face video stream; the face video stream is converted into corresponding latent layer representation by a Wan2.2 VAE encoder, denoising reasoning is carried out by combining input audio and a reference latent layer on the basis of adding noise to a specific region obtained by training the latent layer representation based on a Wan2.2 model, denoising, so as to generate denoised latent layer data, wherein the denoised latent layer data are aligned with the input audio mouth shape, the denoised latent layer data are restored into second face video frames by the VAE decoder, the second face video frames are in one-to-one correspondence with the first face video frames, and the second face video frames are pasted back into the input video segments according to the corresponding face coordinate frames.

Inventors

SU QINGCHAO
XING DONGJIN
YANG HONGJIN

Assignees

厦门蝉镜科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260126

Claims (10)

1. A wan2.2-based SV2V method, comprising: extracting a face coordinate frame of each frame in an input video segment, cutting and scaling to a preset size, generating a first face video frame, and combining the first face video frames in time sequence to obtain a face video stream; Converting the face video stream into a corresponding latent layer representation using a wan2.2 VAE encoder; Based on a Wan2.2 model, denoising reasoning is carried out by combining an input audio and a reference latent layer on the basis of adding noise to a specific region obtained by training the latent layer representation, so as to generate denoised latent layer data, wherein in the reasoning stage, the reference latent layer uses the latent layer representation, and the denoised latent layer data is aligned with the input audio mouth shape; And restoring the denoised latent layer data into a second face video frame by using the VAE decoder, wherein the second face video frame corresponds to the first face video frame one by one, and the second face video frame is pasted back into the input video segment according to the corresponding face coordinate frame.
2. The method of claim 1, wherein the latent layer representation dimension is (48, 31, 30, 30) and the first face video stream dimension is (3, 121, 480, 480).
3. The method of claim 1, wherein the noise-adding of the trained region to the latent layer representation comprises: the index range of the specific region in the latent space feature map is [: 3: -2,2: -2].
4. The method according to claim 1, characterized in that a face noise prediction loss function is introduced during a training phase, which calculates the mean square error between the prediction noise and the real noise for the specific region in the latent layer representation.
5. The method of claim 1, wherein the reference latent layer is encoded using different time segment segments of the reference latent layer taken from the same video during a LoRA training phase.
6. The method of claim 1, wherein extracting the face coordinate frame for each frame in the input video segment, cropping and scaling to a predetermined size, generating the first face video frame, comprises: carrying out smoothing treatment on the face coordinate frame; Cutting the face frame by frame according to the smoothed face coordinate frame to obtain a face region; And scaling the face area to a preset size to generate the first face video frame.
7. The method according to claim 1, wherein the method further comprises: dividing the same video into a plurality of input video segments, wherein each input video segment has a fixed frame number, the plurality of input video segments are connected end to end and are fused by adopting an overlapping region; and reducing the denoised latent layer data into a second face video frame by using the VAE decoder, wherein the method comprises the following steps of: And taking the front second face video frame generated by the current input video segment as a corrected version of the second face video frame at the tail of the last input video segment through reverse reasoning, and carrying out weighted fusion or replacement.
8. An SV2V apparatus based on wan2.2, the apparatus comprising: The preprocessing module is used for extracting a face coordinate frame of each frame in the input video clip, cutting and scaling to a preset size, generating a first face video frame, and combining the first face video frames according to time sequence to obtain a face video stream; the encoding module is used for converting the face video stream into a corresponding latent layer representation by using a Wan2.2 VAE encoder; The model reasoning module is used for carrying out denoising reasoning on the input audio and a reference latent layer on the basis of adding noise to a specific region obtained by training on the latent layer representation and generating denoised latent layer data, wherein in the reasoning stage, the reference latent layer uses the latent layer representation, and the denoised latent layer data is aligned with the input audio mouth shape; And the post-processing module is used for reducing the denoised latent layer data into a second face video frame by using the VAE decoder, wherein the second face video frame corresponds to the first face video frame one by one, and the second face video frame is pasted back into the input video segment according to the corresponding face coordinate frame.
9. An electronic device, comprising: A memory storing execution instructions, and A processor executing the memory-stored execution instructions, causing the processor to perform the method of any one of claims 1-7.
10. A readable storage medium having stored therein execution instructions which when executed by a processor are adapted to carry out the method of any one of claims 1-7.

Description

SV2V method and device based on Wan2.2 Technical Field The disclosure belongs to the technical field of multimedia content generation and editing, and particularly relates to a Wan 2.2-based SV2V method and device. Background Currently mainstream SV2V technologies are mainly developed based on the generation of a countermeasure network (GAN) framework, such as Wav2Lip and DINet, etc. These models are mainly concerned with feature extraction and generation in the spatial dimension when processing video, but lack deep learning capabilities for the video time domain (time series). This results in a poor stability of the generated mouth shape between successive frames, and is prone to problems such as jitter, unnatural transitions, or expression distortion. In contrast, DIT (Diffusion Transformer) network architecture has strong learning capabilities in both temporal and spatial dimensions, enabling better capture of dynamic changes in video sequences. Wherein the wan2.2 multi-modal diffusion model compresses video into a latent space by varying the self-encoder VAE2.2, combined with text/audio condition generation, stands out in terms of identity preservation and motion consistency. However, the application of the method to the SV2V task still faces the adaptive challenge that Wan2.2 native support text-driven video generation lacks an 'audio- & gt mouth' special training target, and accurate synchronization of a phoneme level is difficult to achieve. Based on the above, the method aims at improving the accuracy capability of the SV2V system based on the Wan2.2 model to the die synchronization. Disclosure of Invention The disclosure provides an SV2V method and device based on Wan2.2, which can effectively solve the problems. The present disclosure is implemented as follows: in a first aspect, the present disclosure provides a wan2.2-based SV2V method comprising: extracting a face coordinate frame of each frame in an input video segment, cutting and scaling to a preset size, generating a first face video frame, and combining the first face video frames in time sequence to obtain a face video stream; Converting the face video stream into a corresponding latent layer representation using a wan2.2 VAE encoder; Based on a Wan2.2 model, denoising reasoning is carried out by combining an input audio and a reference latent layer on the basis of adding noise to a specific region obtained by training the latent layer representation, so as to generate denoised latent layer data, wherein in the reasoning stage, the reference latent layer uses the latent layer representation, and the denoised latent layer data is aligned with the input audio mouth shape; And restoring the denoised latent layer data into a second face video frame by using the VAE decoder, wherein the second face video frame corresponds to the first face video frame one by one, and the second face video frame is pasted back into the input video segment according to the corresponding face coordinate frame. In a second aspect, the present disclosure provides a wan2.2-based SV2V device comprising: The preprocessing module is used for extracting a face coordinate frame of each frame in the input video clip, cutting and scaling to a preset size, generating a first face video frame, and combining the first face video frames according to time sequence to obtain a face video stream; the encoding module is used for converting the face video stream into a corresponding latent layer representation by using a Wan2.2 VAE encoder; The model reasoning module is used for carrying out denoising reasoning on the input audio and a reference latent layer on the basis of adding noise to a specific region obtained by training on the latent layer representation and generating denoised latent layer data, wherein in the reasoning stage, the reference latent layer uses the latent layer representation, and the denoised latent layer data is aligned with the input audio mouth shape; And the post-processing module is used for reducing the denoised latent layer data into a second face video frame by using the VAE decoder, wherein the second face video frame corresponds to the first face video frame one by one, and the second face video frame is pasted back into the input video segment according to the corresponding face coordinate frame. In a third aspect, the present disclosure provides an electronic device comprising: A memory storing execution instructions, and A processor executing the memory-stored execution instructions, causing the processor to perform the method of the first aspect. In a fourth aspect, the present disclosure provides a readable storage medium having stored therein execution instructions which when executed by a processor are adapted to carry out the method of the first aspect. Compared with the prior art, the beneficial effects of the present disclosure are: The disclosure provides a SV2V method and device based on Wan2.2, which adopts LoRA adapter trained by mout