US-20260127820-A1 - NEURAL RENDERING FOR INVERSE GRAPHICS GENERATION

US20260127820A1US 20260127820 A1US20260127820 A1US 20260127820A1US-20260127820-A1

Abstract

Approaches are presented for training an inverse graphics network. An image synthesis network can generate training data for an inverse graphics network. In turn, the inverse graphics network can teach the synthesis network about the physical three-dimensional (3D) controls. Such an approach can provide for accurate 3D reconstruction of objects from 2D images using the trained inverse graphics network, while requiring little annotation of the provided training data. Such an approach can extract and disentangle 3D knowledge learned by generative models by utilizing differentiable renderers, enabling a disentangled generative model to function as a controllable 3D “neural renderer,” complementing traditional graphics renderers.

Inventors

Wenzheng Chen
Yuxuan ZHANG
Sanja Fidler
Huan Ling
Jun Gao
Antonio Torralba Barriuso

Assignees

NVIDIA CORPORATION

Dates

Publication Date: 20260507
Application Date: 20251229

Claims (20)

1 . (canceled)
2 . A computer-implemented method, comprising: generating, using a generative network and based on a first dataset, a second dataset at least partially depicting different views of one or more features of the first dataset and annotation data for the second dataset; and generating, using at least one other network and based on the second dataset, one or more multi-dimensional models or representations of at least one feature of the first dataset.
3 . The computer-implemented method of claim 2 , further comprising: using a two-dimensional image of an object as at least part of the first dataset and as input to the generative network; generating, using the generative network, a set of view images of the object as at least part of the second dataset; and using an inverse graphics network as at least part of the at least one other network.
4 . The computer-implemented method of claim 3 , further comprising: training the inverse graphics network and the generative network together using a common loss function, wherein additional representations of the object, rendered by the inverse graphics network, are useable as training data to train the generative network; determining, for individual different views and using the inverse graphics network, a set of three-dimensional information; and rendering a representation of the object using the set of three-dimensional information and the annotation data.
5 . The computer-implemented method of claim 4 , further comprising: using a selection matrix to reduce a dimensionality of features to be included in a latent code, the latent code to be used to render the representation of the object.
6 . The computer-implemented method of claim 5 , further comprising: rendering the representation of the object based, at least in part, upon the latent code and using a differentiable renderer.
7 . The computer-implemented method of claim 5 , wherein the latent code includes camera features for a corresponding view of the different views.
8 . The computer-implemented method of claim 4 , wherein the three-dimensional information for the object includes at least one of a shape, texture, lighting, or background for the object.
9 . The computer-implemented method of claim 2 , further comprising: using a style generative adversarial network as the generative network; and enabling only camera view-related features to be adjusted for generating the different views.
10 . A system, comprising: at least one processor; and memory including instructions that, when executed by the at least one processor, cause the system to: generate, using a generative network and based on a first dataset, a second dataset at least partially depicting different views of one or more features of the first dataset and annotation data for the second dataset; and generate, using at least one other network and based on the second dataset, one or more multi-dimensional models or representations of at least one feature of the first dataset.
11 . The system of claim 10 , wherein the instructions when executed further cause the system to: use a two-dimensional image of an object as at least part of the first dataset and as input to the generative network; generate, using the generative network, a set of view images of the object as at least part of the second dataset; and use an inverse graphics network as at least part of the at least one other network.
12 . The system of claim 11 , wherein the instructions when executed further cause the system to: train the inverse graphics network and the generative network together using a common loss function, wherein additional representations of the object, rendered by the inverse graphics network, are useable as training data to train the generative network; determine, for individual different views and using the inverse graphics network, a set of three-dimensional information; and render a representation of the object using the set of three-dimensional information and the annotation data.
13 . The system of claim 12 , wherein the instructions when executed further cause the system to: use a selection matrix to reduce a dimensionality of features to be included in a latent code; and render the representation of the object based, at least in part, upon the latent code and using a differentiable renderer, wherein the latent code includes camera features for a corresponding view of the different views.
14 . The system of claim 10 , wherein the instructions when executed further cause the system to: use a style generative adversarial network as the generative network; and enable only camera view-related features to be adjusted for generating the different views.
15 . The system of claim 10 , wherein the system comprises at least one of: a system for performing graphical rendering operations; a system for performing simulation operations; a system for performing simulation operations to test or validate autonomous machine applications; a system for performing deep learning operations; a system implemented using an edge device; a system incorporating one or more Virtual Machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.
16 . A system, comprising: one or more processors to generate, using a first network and based on a dataset, one or more multi-dimensional models or representations of at least one feature of an input dataset, the dataset generated by a generative network along with annotation data and generated based on the input dataset, wherein the dataset at least partially depicts different views of one or more features of the input dataset.
17 . The system of claim 16 , wherein the one or more processors are further to: use a two-dimensional image of an object as at least part of the first dataset and as input to the generative network; generate, using the generative network, a set of view images of the object as at least part of the second dataset; and use an inverse graphics network as at least part of the at least one AI/ML network.
18 . The system of claim 17 , wherein the one or more processors are further to: train the inverse graphics network and the generative network together using a common loss function, wherein additional representations of the object, rendered by the inverse graphics network, are useable as training data to train the generative network; determine, for individual different views and using the inverse graphics network, a set of three-dimensional information; and render a representation of the object using the set of three-dimensional information and the annotation data.
19 . The system of claim 18 , wherein the one or more processors are further to: use a selection matrix to reduce a dimensionality of features to be included in a latent code, the latent code to be used to render the representation of the object.
20 . The system of claim 19 , wherein the one or more processors are further to: render the representation of the object based, at least in part, upon the latent code and using a differentiable renderer.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of U.S. patent application Ser. No. 17/981,770, filed on Nov. 7, 2022, and entitled “Neural Rendering For Inverse Graphics Generation,” which claims priority to U.S. Pat. No. 11,494,976, filed as U.S. patent application Ser. No. 17/193,405, filed on Mar. 5, 2021, and entitled “Neural Rendering For Inverse Graphics Generation,” which claims priority to U.S. Provisional Patent Application Ser. No. 62/986,618, filed Mar. 6, 2020, and entitled “3D Neural Rendering and Inverse Graphics with StyleGAN Renderer,” which are hereby incorporated by reference herein in their entirety and for all purposes. This application is also related to co-pending U.S. patent application Ser. No. 17/019,120, filed Sep. 11, 2020, and entitled “Labeling Images Using a Neural Network,” as well as co-pending U.S. patent application Ser. No. 17/020,649, filed Sep. 14, 2020, and entitled “Generating Labels for Synthetic Images Using One or More Neural Networks,” each of which is hereby incorporated herein in its entirety and for all purposes. BACKGROUND A variety of different industries rely upon three-dimensional (3D) modeling for various purposes, including those that require the generation of representations of 3D environments. In order to provide realistic complex environments, it is necessary to have a variety of different types of objects, or similar objects with different appearances, to avoid unrealistic repetition or omissions. Unfortunately, obtaining a large number of three dimensional models can be a complex, expensive, and time (and resource) intensive process. It may be desirable to generate 3D environments from a large amount of available two-dimensional (2D) data, but many existing approaches do not provide for adequate 3D model generation based on 2D data. For approaches that prove promising, such as may involve machine learning, it is still necessary to have a sufficiently large number and variety of labeled training data in order to train the machine learning. An insufficient number and variety of annotated training data instances can prevent a model from being sufficiently trained to produce acceptably accurate results. BRIEF DESCRIPTION OF THE DRAWINGS Various embodiments in accordance with the present disclosure will be described with reference to the drawings, in which: FIGS. 1A, 1B, and 1C illustrate image data that can be utilized, according to at least one embodiment; FIG. 2 illustrates a neural rendering and inverse graphics pipeline, according to at least one embodiment; FIG. 3 illustrates components of a an image generation system, according to at least one embodiment; FIG. 4 illustrates representative points used to determine view or pose information for an object, according to at least one embodiment; FIG. 5 illustrates a process for training an inverse graphics network, according to at least one embodiment; FIG. 6 illustrates components of a system for training and utilizing an inverse graphics network, according to at least one embodiment; FIG. 7A illustrates inference and/or training logic, according to at least one embodiment; FIG. 7B illustrates inference and/or training logic, according to at least one embodiment; FIG. 8 illustrates an example data center system, according to at least one embodiment; FIG. 9 illustrates a computer system, according to at least one embodiment; FIG. 10 illustrates a computer system, according to at least one embodiment; FIG. 11 illustrates at least portions of a graphics processor, according to one or more embodiments; FIG. 12 illustrates at least portions of a graphics processor, according to one or more embodiments; FIG. 13 is an example data flow diagram for an advanced computing pipeline, in accordance with at least one embodiment; FIG. 14 is a system diagram for an example system for training, adapting, instantiating and deploying machine learning models in an advanced computing pipeline, in accordance with at least one embodiment; and FIGS. 15A and 15B illustrate a data flow diagram for a process to train a machine learning model, as well as client-server architecture to enhance annotation tools with pre-trained annotation models, in accordance with at least one embodiment. DETAILED DESCRIPTION Approaches in accordance with various embodiments can provide for the generation of three-dimensional (3D) models, or at least the inferencing of one or more 3D properties (e.g., shape, texture, or light), from two-dimensional (2D) input, such as images or video frames. In particular, various embodiments provide approaches to train an inverse graphics model to generate accurate 3D models or representations from 2D images. In at least some embodiments, a generative model can be used to generate multiple views of an object from different viewpoints or with different poses, with other image features or aspects being kept fixed. The generated images can include pose, view, or camera information that was u