US-20260127779-A1 - COMPOSITIONAL TEXT-TO-VIDEO GENERATION WITH DENSE BLOB VIDEO REPRESENTATIONS

US20260127779A1US 20260127779 A1US20260127779 A1US 20260127779A1US-20260127779-A1

Abstract

Systems and methods are disclosed that generate blob video representations such as blob video parameters and blob video descriptions and use the blob video representations to generate videos. For example, embodiments of the present disclosure may decompose videos into visual primitives (e.g., blob video representations, which may be general representations for controllable video generation). Based on the blob video representations, a blob-grounded text-to-video diffusion model that includes masked three-dimensional (3D) self-attention layers and/or masked spatial cross-attention layers may be developed. The masked 3D self-attention layers and/or masked spatial cross-attention layers may effectively improve regional consistency across frames. Additionally, and/or alternatively, embodiments of the present disclosure may utilize context interpolation that may interpolate text embeddings. Additionally, and/or alternatively, the blob-grounded text-to-video diffusion model may be model-agnostic and may include and/or be associated with a U-Net and/or a diffusion transformer.

Inventors

Weixi Feng
Weili Nie
Chao Liu
Sifei Liu
Arash Vahdat

Assignees

NVIDIA CORPORATION

Dates

Publication Date: 20260507
Application Date: 20250226

Claims (20)

1 . A computer-implemented method for using a blob-grounded text-to-video diffusion model to generate an output video, comprising: obtaining a blob video representation for an object to be generated within the output video, wherein the blob video representation comprises a plurality of blob video parameters and a plurality of blob video descriptions, wherein each of the plurality of blob video parameters indicates a plurality of variables that define an ellipse for the object and each of the plurality of blob video descriptions indicates a textual description of the object; and processing the blob video representation using the blob-grounded text-to-video diffusion model to generate the output video, wherein the blob-grounded text-to-video diffusion model comprises one or more blob-grounded attention layers that uses the blob video representation for the object as a grounding input to generate the output video.
2 . The computer-implemented method of claim 1 , further comprising: training the blob-grounded text-to-video diffusion model using training data comprising a training input video.
3 . The computer-implemented method of claim 2 , wherein training the blob-grounded text-to-video diffusion model comprises: processing the training input video using an open vocabulary video segmentation model to generate a plurality of training blob parameters; processing a subset of the plurality of training blob parameters using a vision language model to generate first video descriptions for the subset of the plurality of training blob parameters; generating a training output video based on the blob-grounded text-to-video diffusion model, the plurality of training blob parameters, and the first video descriptions; and training the blob-grounded text-to-video diffusion model based on comparing the training output video with the training input video.
4 . The computer-implemented method of claim 3 , wherein training the blob-grounded text-to-video diffusion model further comprises: determining a subset of a plurality of frames within the training input video; separating the plurality of training blob parameters into the subset of the plurality of training blob parameters and a second subset of the plurality of training blob parameters based on the subset of the plurality of frames; and performing context interpolation to generate second video descriptions for the second subset of the plurality of training blob parameters based on the first video descriptions, and wherein generating the training output video is further based on the second video descriptions.
5 . The computer-implemented method of claim 4 , wherein determining the subset of the plurality of frames within the training input video comprises: identifying a first anchor frame associated with a first frame from the plurality of frames within the training input video; identifying a second anchor frame associated with a second frame from the plurality of frames within the training input video, wherein in-between the first frame and the second frame comprises one or more intermediate frames from the plurality of frames within the training input video; and populating the subset of the plurality of frames with the first anchor frame and the second anchor frame, wherein the first video descriptions comprises video descriptions associated with the first anchor frame and the second anchor frame.
6 . The computer-implemented method of claim 5 , wherein performing the context interpolation comprises: generating the second video descriptions for the one or more intermediate frames based on linearly interpolating between the video descriptions associated with the first anchor frame and the second anchor frame.
7 . The computer-implemented method of claim 5 , wherein performing the context interpolation comprises: processing the video descriptions associated with the first anchor frame and the second anchor frame using a Perceiver-based model to generate the second video descriptions for the one or more intermediate frames.
8 . The computer-implemented method of claim 1 , wherein the blob-grounded text-to-video diffusion model comprises an encoder, a decoder, and a blob-grounded backbone that comprises a U-Net backbone, wherein the U-Net backbone comprises a plurality of blob-grounded attention layers.
9 . The computer-implemented method of claim 8 , wherein the plurality of blob-grounded attention layers comprises a masked spatial cross-attention layer, and wherein processing the blob video representation using the blob-grounded text-to-video diffusion model to generate the output video comprises: processing embeddings associated with the blob video representation and a spatial cross-attention output from a spatial cross-attention layer using the masked spatial cross-attention layer to generate a masked spatial cross-attention output; and generating the output video based on the masked spatial cross-attention output.
10 . The computer-implemented method of claim 9 , wherein the plurality of blob-grounded attention layers further comprises a masked three-dimensional (3D) self-attention layer, and wherein processing the blob video representation using the blob-grounded text-to-video diffusion model to generate the output video further comprises: processing the masked spatial cross-attention output using a temporal self-attention layer to generate a temporal output; and processing the temporal output with the masked 3D self-attention layer to generate a masked 3D self-attention output, wherein generating the output video is further based on the masked 3D self-attention output.
11 . The computer-implemented method of claim 1 , wherein the blob-grounded text-to-video diffusion model comprises a blob-grounded backbone that comprises a diffusion transformer (DiT) backbone, wherein the DiT backbone comprises a plurality of blob-grounded attention layers.
12 . The computer-implemented method of claim 11 , wherein the plurality of blob-grounded attention layers comprises a masked spatial cross-attention layer and a masked three-dimensional (3D) self-attention layer, and wherein processing the blob video representation using the blob-grounded text-to-video diffusion model to generate the output video comprises: processing embeddings associated with the blob video representation and a 3D self-attention layer output from a 3D self-attention layer using the masked spatial cross-attention layer to generate a masked spatial cross-attention output; processing the masked spatial cross-attention output using the masked 3D self-attention layer to generate a masked 3D self-attention output; and generating the output video based on the masked 3D self-attention output.
13 . The computer-implemented method of claim 1 , further comprising: obtaining a request to generate the output video, wherein the request comprises a user prompt, and wherein obtaining the blob video representation comprises: generating the plurality of blob video parameters and the plurality of blob video descriptions based on processing the user prompt using one or more large language models (LLMs).
14 . The computer-implemented method of claim 13 , wherein generating the plurality of blob video parameters and the plurality of blob video descriptions comprises: generating the plurality of blob video parameters and a first subset of the plurality of blob video descriptions using the user prompt and the one or more LLMs; and generating a second subset of the plurality of blob video descriptions based on performing context interpolation of the first subset of the plurality of blob video descriptions.
15 . The computer-implemented method of claim 1 , wherein at least one of the steps of obtaining and processing are performed on a server or in a data center to generate the output video, and the output video is streamed to a user device.
16 . The computer-implemented method of claim 1 , wherein at least one of the steps of obtaining and processing are performed within a cloud computing environment.
17 . The computer-implemented method of claim 1 , wherein at least one of the steps of obtaining and processing are performed for training, testing, or certifying a neural network employed in a machine, robot, or autonomous vehicle.
18 . The computer-implemented method of claim 1 , wherein at least one of the steps of obtaining and processing is performed on a virtual machine comprising a portion of a graphics processing unit.
19 . A system for using a blob-grounded text-to-video diffusion model to generate an output video, comprising: one or more processors; and a non-transitory computer-readable medium having processor-executable instructions stored thereon, wherein the processor-executable instructions, when executed by the one or more processors, facilitate: obtaining a blob video representation for an object to be generated within the output video, wherein the blob video representation comprises a plurality of blob video parameters and a plurality of blob video descriptions, wherein each of the plurality of blob video parameters indicates a plurality of variables that define an ellipse for the object and each of the plurality of blob video descriptions indicates a textual description of the object; and processing the blob video representation using the blob-grounded text-to-video diffusion model to generate the output video, wherein the blob-grounded text-to-video diffusion model comprises one or more blob-grounded attention layers that uses the blob video representation for the object as a grounding input to generate the output video.
20 . The system of claim 19 , wherein the processor-executable instructions, when executed by the one or more processors, further facilitate: training the blob-grounded text-to-video diffusion model using training data comprising a training input video.

Description

CLAIM OF PRIORITY This application claims the benefit of U.S. Provisional Application No. 63/715,087 (Attorney Docket No. 515138) titled “Blobgen-Vid: Compositional Text-To-Video Generation With Blob Representations,” filed Nov. 1, 2024 and U.S. Provisional Application No. 63/742,553 (Attorney Docket No. 515220) titled “Blobgen-Vid: Compositional Text-To-Video Generation With Blob Representations,” filed Jan. 7, 2025, the entire contents of which are incorporated herein by reference. BACKGROUND Conventional text-to-video generation models have enabled the generation of more realistic videos with high visual quality and intricate motions. However, despite their progress, these conventional text-to-video models struggle to follow complex prompts, where they often neglect key objects or confuse multiple objects as one concept. In addition, users cannot control semantic transitions or camera motions with merely text descriptions with these conventional models. Therefore, it remains an open challenge to enhance the compositionality and controllability of video generators with layout guidance in the diffusion process. To resolve these challenges, newer text-to-video models have been proposed that condition video diffusion models on visual layouts (e.g., bounding boxes that move across the frames of the videos). Compared to other modalities such as depth or semantic maps, bounding boxes may be easier to create and manipulate by users while providing coarse-grained information of local objects. However, two-dimensional (2D) bounding boxes lack perspective invariance (e.g., a three-dimensional (3D) counterpart of a 2D bounding box on an image is not a 3D bounding box and vice versa). Accordingly, there is a need for addressing these issues and/or other issues associated with the prior art. SUMMARY Embodiments of the present disclosure relate to compositional text-to-video generation with dense blob video representations. For example, conventional video generation models may struggle to follow complex text prompts and synthesize multiple objects, raising the need for additional grounding input for improved controllability. As such, embodiments of the present disclosure describe decomposing videos into visual primitives. For instance, embodiments of the present disclosure introduce blob video representations that serve as grounding conditions for generating videos using text-to-video diffusion models (e.g., a blob-grounded text-to-video diffusion model). For example, each blob video representation may correspond to an object instance and may be automatically extracted from videos (and/or three-dimensional (3D) scenes), making it a more general and robust representation for different visual domains. Specifically, a blob video representation may have two components-blob video parameters and blob video descriptions. The blob video representation may assist in enabling both motion and semantic control of visual compositions. To put it another way and as will be described in further detail below, during training, embodiments of the present disclosure may decompose videos into visual primitives such as blob video representations, which may be general representations for controllable video generation. Based on the blob video representations (e.g., the blob video parameters and descriptions), a blob-grounded text-to-video diffusion model may be developed. In some examples, the blob-grounded text-to-video diffusion model may permit users to control object motions and fine-grained object appearance. Additionally, and/or alternatively, the blob-grounded text-to-video diffusion model may include a masked 3D attention module (e.g., masked 3D self-attention layers and/or masked spatial cross-attention layers) that effectively improves regional consistency across frames. Additionally, and/or alternatively, embodiments of the present disclosure may utilize context interpolation (e.g., a context interpolation block) that may interpolate text embeddings such that users may control semantics in specific frames and obtain smooth object transitions. Additionally, and/or alternatively, the blob-grounded text-to-video diffusion model may be model-agnostic. For instance, the blob-grounded text-to-video diffusion model may include a backbone that is and/or includes a U-Net and/or a diffusion transformer (DiT). After conducting extensive experimental results, it was shown that the blob-grounded text-to-video diffusion model described by embodiments of the present disclosure achieves superior zero-shot video generation ability and state-of-the-art layout controllability on multiple benchmarks. Furthermore, when combined with a large language model (LLM) for layout planning, it was shown that the blob-grounded text-to-video diffusion model even outperforms proprietary text-to-video generators in terms of compositional accuracy. In an embodiment, a computer-implemented method for using a blob-grounded text-to-video diffusion model to generate an