EP-4222710-B1 - VIDEO SYNTHESIS WITHIN A MESSAGING SYSTEM

EP4222710B1EP 4222710 B1EP4222710 B1EP 4222710B1EP-4222710-B1

Inventors

CHAI, Menglei
OLSZEWSKI, KYLE
REN, JIAN
TIAN, YU
TULYAKOV, SERGEY

Dates

Publication Date: 20260513
Application Date: 20210930

Claims (15)

A video synthesis method comprising: accessing a system comprising: a primary general adversarial network, GAN, generator (302) for generating videos (320), the generator (302) comprising a pre-trained image generator (306) and a motion generator (308) comprising a plurality of neural networks; and a video discriminator (304) arranged to receive: the generated videos (320) from the generator (302); and real videos (322); generating an updated GAN generator based on the primary GAN generator (302), by performing operations comprising: identifying input data of the updated GAN generator, the input data comprising an initial latent code and a motion domain dataset, which motion domain data set corresponds to a motion trajectory vector (318) that is a noise vector used to model motion diversity, and training the motion generator (308) based on the input data, the training being based on: information received from the video discriminator (304); and an adversarial loss function; and generating a synthesized video based on the primary GAN generator (302) and the input data.
The video synthesis method of claim 1, wherein generating the updated GAN generator based on the primary GAN generator (302) further comprises: adjusting weights of the plurality of neural networks of the primary GAN generator (302) based on an output of the video discriminator (304).
The video synthesis method of claim 1, wherein the motion trajectory vector (318) is a noise vector sampled from a normal distribution.
The video synthesis method of claim 1, wherein the pre-trained image generator (306) is configured to receive the initial latent code and output from the motion generator (308), to generate the synthesized video.
The video synthesis method of claim 1, wherein the pre-trained image generator (306) is pre-trained with a primary dataset comprising at least one of real images or a content dataset.
The video synthesis method of claim 5, wherein a generator corresponding to the motion generator (308) and the pre-trained image generator (306) is trained with a secondary dataset that is different than the primary dataset.
The video synthesis method of claim 1, wherein the motion generator (308) is configured to receive the initial latent code to predict consecutive latent codes.
The video synthesis method of claim 1, wherein the motion generator (308) is implemented with two long short-term memory neural networks.
A system comprising: a processor (804); and a memory (806) storing instructions that, when executed by the processor (804), configure the system to perform operations comprising: access a system comprising: a primary generative adversarial network, GAN, generator (302) for generating videos (320), the generator (302) comprising a pre-trained image generator (306) and a motion generator (308) comprising a plurality of neural networks; and a video discriminator (304) arranged to receive: the generated videos (320) from the generator (302); and real videos (322); generate an updated GAN generator based on the primary GAN generator (302), by performing operations comprising: identifying input data of the updated GAN generator, the input data comprising an initial latent code and a motion domain dataset, which motion domain data set corresponds to a motion trajectory vector (318) that is a noise vector used to model motion diversity, and training the motion generator based on the input data, the training being based on: information received from the video discriminator (304); and an adversarial loss function; and generate a synthesized video based on the primary GAN generator (302) and the input data.
The system of claim 9, wherein generating the updated GAN generator based on the primary GAN generator (302) further comprises: adjust weights of the plurality of neural networks of the primary GAN generator (302) based on an output of the video discriminator.
The system of claim 9, wherein the motion trajectory vector (318) is a noise vector sampled from a normal distribution.
The system of claim 9, wherein the pre-trained image generator (306) is pre-trained with a primary dataset comprising at least one of real images or a content dataset.
The system of claim 12, wherein a generator corresponding to the motion generator (308) and the pre-trained image generator (306) is trained with a secondary dataset that is different than the primary dataset.
The system of claim 9, wherein the motion generator (308) is implemented with two long short-term memory neural networks.
A non-transitory computer-readable storage medium, the computer-readable storage medium including instructions that when executed by a computer, cause the computer to perform operations comprising: access a system comprising: a primary generative adversarial network, GAN, generator (302) for generating videos, the generator (302) comprising a pre-trained image generator (306) and a motion generator (308) comprising a plurality of neural networks; and a video discriminator (304) arranged to receive: the generated videos (320) from the generator (302); and real videos (322); generate an updated GAN generator based on the primary GAN generator (302), by performing operations comprising: identifying input data of the updated GAN generator, the input data comprising an initial latent code and a motion domain dataset, which motion domain data set corresponds to a motion trajectory vector (318) that is a noise vector used to model motion diversity, and training the motion generator (308) based on the input data, the training being based on: information received from the video discriminator (304); and an adversarial loss function; and generate a synthesized video based on the primary GAN generator (302) and the input data.

Description

CROSS-REFERENCE TO RELATED APPLICATION This application claims the benefit of priority to U.S. Provisional Application No. 63/198,151, filed on September 30, 2020. TECHNICAL FIELD The present disclosure relates generally to image and video processing, including video synthesis within a messaging system. BACKGROUND Image and video synthesis are related areas aiming to generate content from noise. Areas of focus include image synthesis methods leading to image-based models capable of achieving improved resolutions and renderings, and wider variations in image content. Sergey Tulyakov et al: "MoCoGAN: Decomposing Motion and Content for Video Generation" describes a framework that generates a video by mapping a sequence of random vectors to a sequence of video frames. Each random vector consists of a content part and a motion part. While the content part is kept fixed, the motion part is realized as a stochastic process. Tero Karras et at: "Training Generative Adversarial Networks with Limited Data" describes an adaptive discriminator augmentation mechanism that significantly stabilizes training in limited data regimes. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. To easily identify the discussion of any particular element or act, the most significant digit or digits in a reference number refer to the figure number in which that element is first introduced. Some nonlimiting examples are illustrated in the figures of the accompanying drawings in which: FIG. 1 is a diagrammatic representation of a networked environment in which the present disclosure may be deployed, in accordance with some examples.FIG. 2 is an illustration of a generative adversarial network architecture, according to some examples.FIG. 3 shows a flow diagram of a video synthesis technique for generating videos using a pre-trained image generator and a motion generator, according to some examples.FIG. 4 shows a flow diagram of an image discrimination technique, according to some examples.FIG. 5 shows a flow diagram of a feature extractor including a contrastive image discriminator, according to some examples.FIG. 6 illustrates an example output sequence of cross-domain video generation, according to some examples.FIG. 7A and FIG. 7B illustrate another set of example output sequences of video synthesis, according to examples described herein.FIG. 8 is a diagrammatic representation of a machine in the form of a computer system within which a set of instructions may be executed for causing the machine to perform any one or more of the methodologies discussed herein, in accordance with some examples.FIG. 9 is a block diagram showing a software architecture within which examples may be implemented. DETAILED DESCRIPTION The description that follows includes systems, methods, techniques, instruction sequences, and computing machine program products that embody illustrative embodiments of the disclosure. In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide an understanding of various embodiments. It will be evident, however, to those skilled in the art, that embodiments may be practiced without these specific details. In general, well-known instruction instances, protocols, structures, and techniques are not necessarily shown in detail. Image and video synthesis are related areas aiming at generating content from noise. Advancements have focused on improving image synthesis methods leading to image-based models capable of achieving large resolutions, high-quality renderings, and wide variations in image content. Image synthesis models may be capable of rendering images often indistinguishable (or virtually indistinguishable) from real ones. However, developments in the area of video synthesis may achieve comparably modest improvements. The statistical complexity of videos and larger model sizes means current video synthesis methods produce relatively low-resolution videos while requiring longer training times and more computational resources. This is particularly relevant on low-resource computers, such as a mobile devices with limited memory and processing power. For example, using a contemporary image generator to generate videos with a target resolution of 256 × 256 pixels may require a substantial computational budget resulting in monetary training costs in the tens of thousands of dollars. In addition, there are hardware requirements needed for such a task. There are two main, but not necessarily exclusive, desirable properties for synthesized videos: (i) high quality (e.g., resolution) for each individual frame, and (ii) temporal consistency throughout the frame sequence (e.g., depicting the same subject matter or content with plausible motion). Prior efforts attempt to achieve both goals with a single framework, making such methods computationa