CN-122002101-A - Method, device, computer equipment and storage medium for generating video

CN122002101ACN 122002101 ACN122002101 ACN 122002101ACN-122002101-A

Abstract

The application relates to the technical field of computer vision and image processing, and particularly discloses a method and a device for generating video, computer equipment and a storage medium. When an image to be processed is received, encoding the image to be processed based on at least one computing device to obtain a condition hidden variable sequence, obtaining a first noise hidden variable sequence, carrying out iterative denoising based on the condition hidden variable sequence and the first noise hidden variable sequence to obtain a video hidden variable sequence, decoding the video hidden variable sequence in parallel based on each computing device to obtain a video frame sequence, and generating a target video corresponding to the image to be processed based on the video frame sequence. According to the application, the distributed parallel encoding and decoding strategy is executed by a plurality of computing devices, so that the video generation time is effectively reduced, and the video generation efficiency is improved.

Inventors

Lin Congbin
SHI TENGFEI
Min Shiwei

Assignees

深圳元智信息技术开发有限公司

Dates

Publication Date: 20260508
Application Date: 20260204

Claims (10)

1. A method of graphically generating video, comprising: when an image to be processed is received, encoding the image to be processed based on at least one computing device to obtain a condition hidden variable sequence; Acquiring a first noise hidden variable sequence, and carrying out iterative denoising based on the condition hidden variable sequence and the first noise hidden variable sequence to acquire a video hidden variable sequence; and decoding the video hidden variable sequence in parallel based on each computing device to obtain a video frame sequence, and generating a target video corresponding to the image to be processed based on the video frame sequence.
2. The method of generating video according to claim 1, wherein the iteratively denoising based on the conditional hidden variable sequence and the first noise hidden variable sequence to obtain a video hidden variable sequence comprises: splitting the first noise hidden variable sequence and the condition hidden variable sequence corresponding to the current time step to obtain at least one sequence block; Performing parallel attention computation on each sequence block based on each computing device to obtain a local attention computation result, and aggregating each local attention computation result to obtain a global attention feature; carrying out parallel forward propagation on the global attention characteristic to obtain a second noise hidden variable sequence; updating a high-dimensional embedded vector corresponding to the next time step based on the second noise hidden variable sequence, and denoising the next time step based on the high-dimensional embedded vector until all the time steps are denoised, so as to obtain the video hidden variable sequence.
3. The method of generating video according to claim 2, wherein each computing device is organized in a logical ring topology, and said splitting the first noise hidden variable sequence and the conditional hidden variable sequence corresponding to the current time step to obtain at least one sequence block comprises: Dividing the first noise hidden variable sequence and the conditional hidden variable sequence into at least one sequence block corresponding to the number according to the number of computing devices in the logic ring topology, and distributing each sequence block to each computing device.
4. The method of generating video according to claim 2, wherein the performing parallel attention computation on each of the sequence blocks based on each of the computing devices to obtain a local attention computation result includes: Performing local non-attention processing on each sequence block based on each computing device to obtain local data; performing at least one collective communication operation to exchange data between the computing devices for each local data; and carrying out local attention calculation on the received local data and the local data in parallel based on each computing device to obtain the local attention calculation result output by each computing device.
5. The method of generating video according to claim 2, wherein the splitting the first noise hidden variable sequence and the conditional hidden variable sequence corresponding to the current time step to obtain at least one sequence block further comprises: Obtaining a current embedded vector corresponding to the current time step, carrying out hash matching on the current embedded vector, and obtaining a similar embedded vector and a similar output vector corresponding to the similar embedded vector; Analyzing a first difference degree of the current embedded vector and the similar embedded vector, fitting the first difference degree, and predicting a predicted output vector corresponding to the current embedded vector; analyzing a second degree of difference of the similar output vector and the predicted output vector; and when the second difference degree is smaller than a preset difference threshold value, taking the similar output vector as a second noise hidden variable sequence obtained by denoising the current time step.
6. The method of generating video according to claim 1, wherein said encoding said image to be processed based on at least one computing device upon receiving said image to be processed, obtaining a sequence of conditional hidden variables, comprises: when an image to be processed is received, the image to be processed is segmented, and at least one sub-image is obtained; Encoding each sub-image in parallel based on each computing device to obtain a local hidden variable block corresponding to each sub-image; And aggregating the local hidden variable blocks to generate the conditional hidden variable sequence.
7. The method of any of claims 1 to 6, wherein decoding the sequence of video hidden variables based on each of the computing devices in parallel to obtain a sequence of video frames comprises: Dividing the video hidden variable sequence to obtain at least one subsequence; Decoding each sub-sequence in parallel based on each computing device to obtain a local video frame block; And aggregating the local video frame blocks to obtain the video frame sequence.
8. An graphically generated video apparatus, comprising: the image coding module is used for coding the image to be processed based on at least one computing device when the image to be processed is received, so as to obtain a condition hidden variable sequence; The iterative denoising module is used for acquiring a first noise hidden variable sequence, and carrying out iterative denoising based on the condition hidden variable sequence and the first noise hidden variable sequence to acquire a video hidden variable sequence; The video obtaining module is used for decoding the video hidden variable sequence in parallel based on each computing device to obtain a video frame sequence, and generating a target video corresponding to the image to be processed based on the video frame sequence.
9. A computer device, the computer device comprising a memory and a processor; The memory is used for storing a computer program; the processor for executing the computer program and for implementing the method of video graphically as claimed in any one of claims 1 to 7 when executing the computer program.
10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the method of video graphically as claimed in any one of claims 1 to 7.

Description

Method, device, computer equipment and storage medium for generating video Technical Field The present application relates to the field of computer vision and image processing technologies, and in particular, to a method and apparatus for generating video, a computer device, and a storage medium. Background In recent years, the generation type AI has made a remarkable breakthrough in the field of visual content creation, wherein the graphical video technology can generate a coherent and realistic short video sequence according to an input static image, and has wide application prospects in the fields of film and television special effects, digital entertainment, virtual reality, content creation and the like. FRAMEPACK is often used for reasoning about such generative scenarios as a high performance graphical video model. However, the existing FRAMEPACK-based graphical video model needs to have strong space-time modeling capability for generating a high-quality and strong-consistency video sequence, and a great deal of serial dependence exists in the calculation process, so that the graphical video efficiency is low. How to improve the video efficiency of the graphics becomes a urgent problem to be solved. Disclosure of Invention The application provides a method, a device, computer equipment and a storage medium for generating video, which are used for improving the efficiency of the video generation. In a first aspect, the present application provides a method of graphically generating video, the method comprising: when an image to be processed is received, encoding the image to be processed based on at least one computing device to obtain a condition hidden variable sequence; Acquiring a first noise hidden variable sequence, and carrying out iterative denoising based on the condition hidden variable sequence and the first noise hidden variable sequence to acquire a video hidden variable sequence; and decoding the video hidden variable sequence in parallel based on each computing device to obtain a video frame sequence, and generating a target video corresponding to the image to be processed based on the video frame sequence. In a second aspect, the present application also provides an apparatus for generating video, the apparatus comprising: the image coding module is used for coding the image to be processed based on at least one computing device when the image to be processed is received, so as to obtain a condition hidden variable sequence; The iterative denoising module is used for acquiring a first noise hidden variable sequence, and carrying out iterative denoising based on the condition hidden variable sequence and the first noise hidden variable sequence to acquire a video hidden variable sequence; The video obtaining module is used for decoding the video hidden variable sequence in parallel based on each computing device to obtain a video frame sequence, and generating a target video corresponding to the image to be processed based on the video frame sequence. In a third aspect, the present application also provides a computer device comprising a memory and a processor, the memory being for storing a computer program, the processor being for executing the computer program and for implementing the method of graphically generating video as described above when the computer program is executed. In a fourth aspect, the present application also provides a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to implement an icon video method as described above. The application discloses a method, a device, computer equipment and a storage medium for generating video, which are used for encoding an image to be processed based on at least one computing equipment to obtain a conditional hidden variable sequence when the image to be processed is received, obtaining a first noise hidden variable sequence, carrying out iterative denoising based on the conditional hidden variable sequence and the first noise hidden variable sequence to obtain a video hidden variable sequence, decoding the video hidden variable sequence in parallel based on each computing equipment to obtain a video frame sequence, and generating a target video corresponding to the image to be processed based on the video frame sequence. According to the application, the distributed parallel encoding and decoding strategy is executed by a plurality of computing devices, so that the video generation time is effectively reduced, and the video generation efficiency is improved. Drawings In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort