US-12626461-B2 - Complete 3D object reconstruction from an incomplete image

US12626461B2US 12626461 B2US12626461 B2US 12626461B2US-12626461-B2

Abstract

A modeling system accesses a two-dimensional (2D) input image displayed via a user interface, the 2D input image depicting, at a first view, a first object. At least one region of the first object is not represented by pixel values of the 2D input image. The modeling system generates, by applying a 3D representation generation model to the 2D input image, a three-dimensional (3D) representation of the first object that depicts an entirety of the first object including the first region. The modeling system displays, via the user interface, the 3D representation, wherein the 3D representation is viewable via the user interface from a plurality of views including the first view.

Inventors

Jae Shin Yoon
Yangtuanfeng Wang
Krishna Kumar Singh
Junying Wang
Jingwan Lu

Assignees

ADOBE INC.

Dates

Publication Date: 20260512
Application Date: 20230905

Claims (20)

1 . A method performed by one or more computing devices associated with a modeling system, comprising: accessing a two-dimensional (2D) input image displayed via a user interface, the 2D input image depicting, at a first view, a first object, wherein at least one region of the first object is not represented by pixel values of the 2D input image; training a three-dimensional (3D) convolutional neural network of a 3D representation generation model using a 3D discriminator to produce generative volumetric features in the at least one region of the first object in the 2D input image; combining fine-detailed surface normals in a multiview normal fusion framework to produce multiview surface normals based on pixel-aligned normal features; combining, using the 3D representation generation model applied to the 2D input image, the multiview surface normals of the first object with the generative volumetric features to produce a 3D representation of the first object that depicts an entirety of the first object including the at least one region; and displaying, via the user interface, the 3D representation, wherein the 3D representation is viewable via the user interface from a plurality of views including the first view.
2 . The method of claim 1 , wherein at the first view the at least one region is outside of an area of the 2D input image.
3 . The method of claim 1 , wherein at the first view the at least one region is occluded by a second object depicted in the 2D input image.
4 . The method of claim 1 , further comprising: generating, based on the generative volumetric features determined based on the 2D input image, a coarse geometry for the first object using a coarse multilayer perceptron (MLP); and generating, based on the coarse geometry and intermediate features generated by the coarse MLP, a fine geometry for the first object.
5 . The method of claim 4 , wherein the multiview surface normals are based on the coarse geometry.
6 . The method of claim 4 , further comprising: generating an image feature volume for the 2D input image by extracting features of the 2D input image in a depth direction; and determining concatenated image features by concatenating the image feature volume with a 3D pose of the first object recorded on the image feature volume, the 3D pose determined from a 3D object model.
7 . The method of claim 4 , further comprising applying, based on the fine geometry generated for the first object and the 2D input image, a progressive texture inpainting process to generate the 3D representation.
8 . The method of claim 1 , wherein the 3D representation is displayed at the first view and depicts the at least one region of the first object, and further comprising: responsive to receiving an input via the user interface, displaying the 3D representation at a second view of the plurality of views that is different from the first view, wherein the 3D representation displayed at the second view depicts the at least one region of the first object.
9 . A system comprising: a memory component; and a processing device coupled to the memory component, the processing device configured to perform operations comprising: accessing a two-dimensional (2D) input image displayed via a user interface, the 2D input image depicting, at a first view, a first object, wherein at least one region of the first object is not represented by pixel values of the 2D input image; training a three-dimensional (3D) convolutional neural network of a 3D representation generation model using a 3D discriminator to produce generative volumetric features in the at least one region of the first object in the 2D input image; combining fine-detailed surface normals in a multiview normal fusion framework to produce multiview surface normals based on pixel-aligned normal features; combining, using the 3D representation generation model applied to the 2D input image, the multiview surface normals of the first object with the generative volumetric features to produce a 3D representation of the first object that depicts an entirety of the first object including the at least one region; and displaying, via the user interface, the 3D representation, wherein the 3D representation is viewable via the user interface from a plurality of views including the first view, wherein the 3D representation is displayed at the first view and depicts the at least one region of the first object.
10 . The system of claim 9 , the operations further comprising: responsive to receiving an input via the user interface, displaying the 3D representation at a second view of the plurality of views that is different from the first view, wherein the 3D representation displayed at the second view depicts the at least one region of the first object.
11 . The system of claim 9 , wherein at the first view the at least one region is outside of an area of the 2D input image or is occluded by a second object depicted in the 2D input image.
12 . The system of claim 9 , the operations further comprising: generating, based on the generative volumetric features determined based on the 2D input image, a coarse geometry for the first object using a coarse multilayer perceptron (MLP); and generating, based on the coarse geometry and intermediate features generated by the coarse MLP, a fine geometry for the first object.
13 . The system of claim 12 , wherein the multiview surface normals are based on the coarse geometry.
14 . The system of claim 12 , the operations further comprising: generating an image feature volume for the 2D input image by extracting features of the 2D input image in a depth direction; and determining concatenated image features by concatenating the image feature volume with a 3D pose of the first object recorded on the image feature volume, the 3D pose determined from a 3D object model.
15 . The system of claim 12 , the operations further comprising applying, based on the fine geometry generated for the first object and the 2D input image, a progressive texture inpainting process to generate the 3D representation.
16 . The system of claim 12 , wherein the 3D representation is displayed at the first view and depicts the at least one region of the first object, the operations further comprising: responsive to receiving an input via the user interface, displaying the 3D representation at a second view of the plurality of views that is different from the first view, wherein the 3D representation displayed at the second view depicts the at least one region of the first object.
17 . A non-transitory computer-readable medium storing executable instructions, which when executed by a processing device, cause the processing device to perform operations comprising: accessing a two-dimensional (2D) input image displayed via a user interface, the 2D input image depicting, at a first view, a first object, wherein at least one region of the first object is not represented by pixel values of the 2D input image, wherein the at least one region is outside of an area of the 2D input image or is occluded by a second object depicted in the 2D input image; combining fine-detailed surface normals in a multiview normal fusion framework to produce multiview surface normals based on pixel-aligned normal features; training a three-dimensional (3D) convolutional neural network of a 3D representation generation model using a 3D discriminator to produce generative volumetric features in the at least one region of the first object in the 2D input image; combining, using the 3D representation generation model applied to the 2D input image, the multiview surface normals of the first object with the generative volumetric features to produce a 3D representation of the first object that depicts an entirety of the first object including the at least one region; and displaying, via the user interface, the 3D representation, wherein the 3D representation is viewable via the user interface from a plurality of views including the first view, wherein the 3D representation is displayed at the first view and depicts the at least one region of the first object.
18 . The non-transitory computer-readable medium of claim 17 , the operations further comprising: generating, based on the generative volumetric features determined based on the 2D input image, a coarse geometry for the first object using a coarse multilayer perceptron (MLP); generating, based on the coarse geometry and intermediate features generated by the coarse MLP, a fine geometry for the first object; and applying, based on the fine geometry generated for the first object and the 2D input image, a progressive texture inpainting process to generate the 3D representation.
19 . The non-transitory computer-readable medium of claim 18 , the operations further comprising: generating an image feature volume for the 2D input image by extracting features of the 2D input image in a depth direction; and determining concatenated image features by concatenating the image feature volume with a 3D pose of the first object recorded on the image feature volume, the 3D pose determined from a 3D object model.
20 . The system of claim 12 , the operations further comprising: responsive to receiving an input via the user interface, displaying the 3D representation at a second view of the plurality of views that is different from the first view, wherein the 3D representation displayed at the second view depicts the at least one region of the first object.

Description

TECHNICAL FIELD This disclosure generally relates to techniques for using machine learning models to generate a three-dimensional (3D) representation of an object from a two-dimensional (2D) image of the object. More specifically, but not by way of limitation, this disclosure relates to generating a 3D representation of an object from an incomplete 2D image of the object. BACKGROUND Conventional scene generation systems can generate a full 3D representation (e.g., e.g., a 3D model) of an object (e.g., e.g., a human person, an animal, or other object) from a 2D image of the object. Conventional approaches can use neural networks to learn image features at each pixel (e.g., pixel aligned features) of the 2D image, which enable continual classification of a position in 3D along a camera ray, to generate 3D representations with high-quality local details. SUMMARY The present disclosure describes techniques for applying a 3D representation generation model to an 2D input image of an object to generate a 3D model of the object. A modeling system accesses a two-dimensional (2D) input image displayed via a user interface, the 2D input image depicting, at a first view, a first object. At least one region of the first object is not represented by pixel values of the 2D input image. The modeling system generates, by applying a 3D representation generation model to the 2D input image, a three-dimensional (3D) representation of the first object that depicts an entirety of the first object including the first region. The modeling system displays, via the user interface, the 3D representation, wherein the 3D representation is viewable via the user interface from a plurality of views including the first view. Various embodiments are described herein, including methods, systems, non-transitory computer-readable storage media storing programs, code, or instructions executable by one or more processing devices, and the like. These illustrative embodiments are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided there. BRIEF DESCRIPTION OF THE DRAWINGS Features, embodiments, and advantages of the present disclosure are better understood when the following Detailed Description is read with reference to the accompanying drawings. FIG. 1 depicts an example of a computing environment for using a 3D reconstruction generation model to generate a 3D representation of an object from an 2D input image of incompletely-depicted object, according to certain embodiments disclosed herein. FIG. 2 depicts a method for using a 3D representation generation model to generate a 3D representation of an object from an 2D input image of incompletely-depicted object, according to certain embodiments disclosed herein. FIG. 3 depicts a 3D representation generation model, according to certain embodiments disclosed herein. FIG. 4 depicts a multiview normal surface fusion pipeline of the 3D representation generation model of FIG. 3, according to certain embodiments described herein. FIG. 5 depicts a multiview normal enhancement framework of the 3D representation generation model of FIG. 3, according to certain embodiments described herein. FIG. 6 illustrates an example 3D representation generated using the 3D representation generation model of FIG. 3, according to certain embodiments described herein. FIG. 7 illustrates example 3D representations generated using the 3D representation generation model described herein compared to conventionally generated 3D representations, according to certain embodiments described herein. FIG. 8 depicts an example of a computing system that performs certain operations described herein, according to certain embodiments disclosed herein. FIG. 9 depicts an example of a cloud computing system that performs certain operations described herein, according to certain embodiments disclosed herein. DETAILED DESCRIPTION In the following description, for the purposes of explanation, specific details are set forth in order to provide a thorough understanding of certain embodiments. However, it will be apparent that various embodiments may be practiced without these specific details. The figures and description are not intended to be restrictive. The words “exemplary” or “example” are used herein to mean “serving as an example, instance, or illustration.” Any embodiment or design described herein as “exemplary” or “example” is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Conventional modeling systems can generate a full 3D representation (e.g., 3D model) of an object (e.g., a human person, an animal, or other object) from a 2D image of the object. Conventional approaches can use neural networks to learn image features at each pixel (e.g., pixel aligned features) of the 2D image, which enable continual classification of a position in a