EP-4736116-A1 - GENERATIVE IMAGE DYNAMICS

EP4736116A1EP 4736116 A1EP4736116 A1EP 4736116A1EP-4736116-A1

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating, from an input image of a scene at a current time point, a motion output. The motion output incudes, for each of a plurality of pixels of the input image, motion data that characterizes predicted motion of the pixel over a plurality of future time points that are after the current time point. Thus, the system generates the dynamics of a video from a single, still image using a motion prediction neural network.

Inventors

LI, ZHENGQI
TUCKER, RICHARD
HOLYNSKI, Aleksander Karim
SNAVELY, KEITH NOAH

Assignees

Google LLC

Dates

Publication Date: 20260506
Application Date: 20240913

Claims (20)

1. A method performed by one or more computers, the method comprising: receiving an input image of a scene at a current time point, wherein the input image comprises a plurality of pixels; and processing the input image using a motion prediction neural network to generate a motion output that comprises, for each of the plurality of pixels, motion data that characterizes predicted motion of the pixel over a plurality of future time points that are after the current time point.
2. The method of claim 1, wherein the motion prediction neural network processes only the input image at the current time point and not any other images of the scene at any other time points.
3. The method of any preceding claim, wherein, for each pixel, the motion data characterizes, for each of the plurality of future time points, a predicted displacement between coordinates of the pixel in a future image at the future time point relative to coordinates of the pixel in the input image.
4. The method of any preceding claim, wherein, for each pixel, the motion data represents the predicted motion of the pixel over the plurality of future time points in a frequency domain.
5. The method of claim 4, wherein, for each pixel, the motion data characterizes a motion spectrum over a plurality of output frequencies for the predicted motion of the pixel over the plurality of future time points.
6. The method of claim 5, wherein, for each pixel, the motion data comprises, for each of the plurality of output frequencies, a set of coefficients of a motion basis for the output frequency.
7. The method of claim 6, wherein the set of coefficients are a set of complex Fourier coefficients for the x and y dimensions.
8. The method of any preceding claim, wherein the motion prediction neural network comprises a diffusion neural network, and wherein processing the input image using a motion prediction neural network to generate a motion output comprises: initializing a representation of the motion output; and updating the representation of the motion output at each of a plurality of update iterations, the updating comprising: generating a denoising output, the generating comprising processing a diffusion input comprising (i) the representation of the motion output and (ii) a representation of the input image using the diffusion neural network to generate a first denoising output; and updating the representation using the denoising output.
9. The method of claim 8, wherein initializing a representation of the motion output comprises sampling at least some of the values in the representation from a noise distribution.
10. The method of claim 8 or 9, wherein the denoising output is an estimate of a noise component of the representation of the motion output.
11. The method of any one of claims 7-10, wherein the motion prediction neural network comprises a decoder neural network, and wherein the representation of the motion output is in a latent space and wherein generating the motion output further comprises: processing the representation of the motion output after the last update iteration using the decoder neural network to generate the motion output.
12. The method of any one of claims 7-11 , wherein the diffusion neural network comprises one or more frequency attention layers.
13. The method of any preceding claim, when dependent on claim 6, wherein the output of the motion prediction neural network comprises, for each pixel and for each of the plurality of output frequencies, a set of adaptively normalized coefficients of the motion basis for the output frequency.
14. The method of any preceding claim, further comprising: generating, from the motion output, a respective future image of the scene at each of one more of the future time points.
15. The method of claim 14, wherein generating, from the motion output, a respective future image of the scene at each of one more of the future time points comprises: generating a respective motion trajectory for each of the plurality of pixels from the motion output that specifies, for each of the one or more future time points, coordinates of the pixel in the respective future image at the time point.
16. The method of claim 15 when dependent on claim 4, wherein generating a respective motion trajectory for each of the plurality of pixels comprises applying a transform from the frequency domain to a time domain to the motion data for the pixel.
17. The method of claim 16. wherein the transform is an inverse temporal Fast Fourier Transform.
18. The method of any one of claims 15-17, further comprising: receiving a force input specifying a force applied to an object depicted in the input image, wherein generating a respective motion trajectory for each of the plurality of pixels comprises generating the respective motion trajectory 7 from the force input and the motion data for the pixel.
19. The method of any one of claims 15-18, wherein generating the future image at each of the one or more future time points comprises: generating features of the input image; for each future time point: splatting the features of the input image to the future time point to generate splatted features using a predicted motion field for the future time point that specifies the coordinates of the plurality of pixels in the respective future image at the future time point; and generating the future image at the future time point from the splatted features for the future time point.
20. The method of claim 19, wherein generating the future image at the future time point from the splatted features for the future time point comprises generating the future image using an image synthesis neural network that is conditioned on the splatted features for the future time point.

Description

GENERATIVE IMAGE DYNAMICS BACKGROUND This specification relates to generating images using neural networks. Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to or more other layers in the network, i.e., one or more other hidden layers, the output layer, or both. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters. SUMMARY This specification describes a system implemented as computer programs on one or more computers that performs a generative image dynamics task on an input image. That is. the system generates, from the input image and using a generative neural network (also referred to as a ‘'motion prediction” neural network), predicted future dynamics of the pixels in the input image, i.e., generates an output that characterizes the predicted future motion of the pixels in the input image. Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages. Given a single image, the described system can use a neural network, e g., a diffusion neural network, to generate a prediction of per-pixel long-term motion of the pixels in the image. For example, the system can make the prediction in the frequency domain, e.g., in the Fourier domain. This representation can be converted into dense motion trajectories that span an entire video and, along with an image-based rendering engine, can be used for a number of downstream applications, such as turning still images into seamlessly looping dynamic videos, or allowing users to realistically interact with objects in real pictures. In other words, as one example, by generating motion outputs as described in this specification, the described techniques can effectively generate a realistic looping video that captures realistic motion of one or more objects in a scene from a single, still image of the scene. As another example, as one example, by generating motion outputs as described in this specification, the described techniques can effectively generate a realistic video that captures realistic motion of one or more objects in a scene in response to a force being applied at a specified point in the scene from a single, still image of the scene and an input that identifies the applied force. More generally, the described techniques require only a single image to generate coherent long-term motion that realistically models the motion of real objects. For example, when the prediction is made in the frequency domain, the predictions capture the essence of pixel movements more efficiently in a lower-dimensional space, which leads to more coherent long-term generation and more fine-grained control over animations relative to other approaches. The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram of an example image dynamics system. FIG. 2 is an example of the operation of the image dynamics system. FIG. 3 is a flow diagram of an example process for processing an input image. FIG. 4 is an example of generating a future image. FIG. 5 is an example of generating a motion output when the motion generation neural network is a diffusion neural network. FIG. 6 shows an example of the performance of the described techniques. FIG. 7 shows another example of the performance of the described techniques. Like reference numbers and designations in the various drawings indicate like elements. DETAILED DESCRIPTION FIG. 1 is a diagram of an example image dynamics system 100. The image dynamics system 100 is an example of a system implemented as computer programs on one or more computers in one or more locations, in which the systems, components, and techniques described below can be implemented. The system 100 is a system that performs a generative image dynamics task on an input image 102. That is. the system 100 generates, from the input image 102 and using a generative neural network 110 (also referred to as a “motion prediction” neural network), predicted future dynamics of the pixels in the input image 102, i.e., generates an output that characterizes the predicted future motion of the pixels in the input image 102. Thus, the task is referred to as a ‘'generative” image dynamics task because the system 100 predicts the dynamics of the pixels in the image using a generative neural network, i.e., rather than determining the dynamics from actual changes between multiple images taken at different time points. Ge