Search

CN-122002104-A - Video generation method, device, electronic equipment and computer storage medium

CN122002104ACN 122002104 ACN122002104 ACN 122002104ACN-122002104-A

Abstract

The disclosure provides a video generation method, a video generation device, electronic equipment and a computer storage medium, which can be applied to the technical field of artificial intelligence. The method comprises the steps of obtaining an input text for indicating video generation requirements and an input video serving as a video generation reference, compressing the input video into a first latent code in a potential space by using a video encoder, denoising the first latent code added with random noise by taking the input video as a denoising condition to obtain a second latent code, reconstructing the second latent code into the first video according to the input text by using a first video decoder symmetrical to the video encoder, and upsampling the first video by using a shared upsampling module to obtain the second video, wherein the resolutions of the first video and the input video are identical and smaller than those of the second video, and the shared upsampling module shares training parameters of a video generation task and a resolution optimization task.

Inventors

  • LIU KUN
  • LIU XINCHEN

Assignees

  • 京东科技控股股份有限公司

Dates

Publication Date
20260508
Application Date
20260310

Claims (12)

  1. 1. A video generation method, comprising: Acquiring an input text for indicating video generation requirements and an input video serving as a video generation reference; compressing the input video with a video encoder into a first latent code in a potential space; denoising the first latent code added with random noise by taking the input text as a denoising condition to obtain a second latent code; Reconstructing the second latent code into a first video from the input text using a first video decoder symmetric to the video encoder, and And upsampling the first video by using a shared upsampling module to obtain a second video, wherein the resolutions of the first video and the input video are the same and smaller than the resolution of the second video, and the shared upsampling module shares training parameters of a video generating task and a resolution optimizing task.
  2. 2. The method of claim 1, wherein reconstructing the second latent code into a first video from the input text using a first video decoder that is symmetrical to the video encoder, comprises using the first video decoder, Performing three-dimensional transposition convolution processing on the second latent code to increase the characteristic dimension of the second latent code in the potential space to the characteristic dimension of the input video so as to obtain first intermediate information; Fusing the semantic feature information of the input text and the first intermediate information to obtain fused information, and Reconstructing the fusion information into the first video using an activation function.
  3. 3. The method of claim 1, wherein denoising the first latent code added with random noise using the input text as a denoising condition to obtain a second latent code, comprises: and taking the video generation requirement indicated by the input text as semantic guidance, and carrying out iterative denoising on the first latent code added with random noise by using a diffusion model to obtain a second latent code matched with the semantic of the input text.
  4. 4. The method according to claim 1 or 2, wherein the video generation task is for generating the first video, the resolution optimization task is performed in a training process of the shared upsampling module for reconstructing a third latent code with a pre-trained second video decoder to obtain a third video, the third latent code being obtained based on a first tag video compression, the first tag video being a tag video having the same resolution as a sample input video; the shared upsampling module is trained by using a first sample video and the third video, wherein the first sample video is obtained by reconstructing a second sample latent code.
  5. 5. The method of claim 4, wherein the pre-trained second video decoder reconstructs the third latent code to the third video by: Performing three-dimensional transpose convolution processing on the third latent code to increase the feature dimension of the third latent code located in the potential space to the feature dimension of the first tag video to obtain second intermediate information, wherein the pre-trained second video decoder and the first video decoder respectively obtain the second intermediate information and the first intermediate information by using convolution layers with the same structure, and parameters of the convolution layers of the pre-trained second video decoder and the first video decoder are different, and Reconstructing the second intermediate information into the third video using an activation function.
  6. 6. The method of claim 4, wherein the shared upsampling module is trained by: respectively upsampling the first video and the third video of the sample by using a pre-trained shared upsampling module to obtain a second video and a fourth video of the sample; Determining an objective loss function according to at least two of the sample input video, the sample second video, the fourth video, the first tag video, a second tag video, the sample first video and the third video, wherein the resolution of the second tag video is the same as that of the sample second video, downsampling the second tag video to obtain the first tag video, and And under the constraint of the target loss function, performing joint fine tuning on the pre-trained shared upsampling module, the pre-trained video encoder, the pre-trained first video decoder and the pre-trained second video decoder until the target loss function converges to obtain the shared upsampling module, the video encoder, the first video decoder and the second video decoder.
  7. 7. The method of claim 6, wherein the objective loss function comprises a video generation task loss and a resolution optimization task loss, the video generation task loss comprising at least one of a first loss for indicating a pixel reconstruction difference between the sample second video and the second tagged video, a second loss for indicating a pixel reconstruction difference between the sample first video and the first tagged video, a third loss for indicating an image semantic difference between the sample second video and the second tagged video; The resolution optimization task penalty includes at least one of a fourth penalty for indicating a pixel reconstruction difference between the fourth video and the second tagged video, a fifth penalty for indicating a pixel reconstruction difference between the third video and the first tagged video, a sixth penalty for indicating an image semantic difference between the fourth video and the second tagged video, and a seventh penalty for indicating a pixel difference between adjacent video frames in the fourth video.
  8. 8. The method of claim 6, wherein the video encoder shares training parameters of the video generation task and the resolution optimization task: The target loss function further comprises a shared constraint loss for indicating a distribution difference of the third latent code and a fourth latent code in a potential space, wherein the fourth latent code is obtained by compressing the first video of the sample through the pre-trained video encoder; the video generation task penalty further includes a regularization penalty for the pre-trained video encoder.
  9. 9. A video generating apparatus comprising: the acquisition module is used for acquiring an input text for indicating video generation requirements and an input video serving as a video generation reference; A compression module for compressing the input video into a first latent code in a potential space using a video encoder; The denoising module is used for denoising the first latent code added with random noise by taking the input text as a denoising condition to obtain a second latent code; the reconstruction module is used for reconstructing the second latent code into a first video according to the input text by using a first video decoder symmetrical to the video encoder; The generation module is used for upsampling the first video by using the shared upsampling module to obtain a second video, wherein the resolutions of the first video and the input video are the same and smaller than the resolution of the second video, and the shared upsampling module shares training parameters of a video generation task and a resolution optimization task.
  10. 10. An electronic device, comprising: One or more processors; A memory for storing one or more programs, Wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1 to 8.
  11. 11. A computer readable storage medium having stored thereon executable instructions which when executed by a processor cause the processor to implement the method of any of claims 1 to 8.
  12. 12. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 8.

Description

Video generation method, device, electronic equipment and computer storage medium Technical Field The present disclosure relates to the field of artificial intelligence, and more particularly, to a video generation method, apparatus, electronic device, and computer storage medium. Background With the rapid development of artificial intelligence and computer technology, video generation using artificial intelligence has become increasingly widespread. Video Generation (Video Generation) is a multimedia technology aimed at generating high definition Video, such as 4K or 8K, with text as input. In the process of realizing the disclosed conception, the inventor finds that the related technology has at least the problems that the whole generation frame of a single-stage video generation mode is huge and the calculation complexity is high, the video generation link of a multi-stage video generation mode is long, the error of a certain stage can be gradually amplified in a subsequent stage, the generation quality is influenced, and the whole calculation complexity is still high. Disclosure of Invention In view of this, the present disclosure provides a video generating method, apparatus, electronic device, and computer storage medium. One aspect of the disclosure provides a video generation method including obtaining an input text indicating a video generation requirement and an input video serving as a video generation reference, compressing the input video into a first latent code in a potential space by using a video encoder, denoising the first latent code added with random noise to obtain a second latent code by using the input text as a denoising condition, reconstructing the second latent code into the first video according to the input text by using a first video decoder symmetrical to the video encoder, and upsampling the first video by using a shared upsampling module to obtain the second video, wherein the resolutions of the first video and the input video are the same and smaller than those of the second video, and the shared upsampling module shares training parameters of a video generation task and a resolution optimization task. According to the embodiment of the disclosure, a first video decoder symmetrical to a video encoder is utilized to reconstruct a second latent code into a first video according to an input text, the method comprises the steps of performing three-dimensional transposition convolution processing on the second latent code by utilizing the first video decoder to lift the characteristic dimension of the second latent code positioned in a potential space to the characteristic dimension of the input video to obtain first intermediate information, fusing semantic characteristic information of the input text and the first intermediate information to obtain fused information, and reconstructing the fused information into the first video by utilizing an activation function. According to the embodiment of the disclosure, the method for denoising the first latent code added with the random noise by taking the input text as a denoising condition to obtain the second latent code comprises the steps of taking video generation requirements indicated by the input text as semantic guidance, and iteratively denoising the first latent code added with the random noise by utilizing a diffusion model to obtain the second latent code matched with the semantic of the input text. According to the embodiment of the disclosure, the video generation task is used for generating a first video, the resolution optimization task is executed in a training process of the shared up-sampling module and is used for reconstructing a third latent code by using a pre-trained second video decoder to obtain a third video, the third latent code is obtained based on first tag video compression, the first tag video is tag video with the same resolution as that of a sample input video, the shared up-sampling module is trained by using the sample first video and the third video, and the sample first video is obtained by reconstructing a sample second latent code. According to the embodiment of the disclosure, a pre-trained second video decoder reconstructs a third latent code to obtain a third video by performing three-dimensional transpose convolution processing on the third latent code to increase the characteristic dimension of the third latent code located in a potential space to the characteristic dimension of a first tag video to obtain second intermediate information, wherein the second intermediate information and the first intermediate information are respectively obtained by the second video decoder and the first video decoder through convolution layers with the same structure, parameters of the convolution layers of the second video decoder and the first video decoder are different, and reconstructing the second intermediate information into the third video through an activation function. According to the embodim