JP-2026514445-A - 3D mesh generation using signed distance function

JP2026514445AJP 2026514445 AJP2026514445 AJP 2026514445AJP-2026514445-A

Abstract

A method, system, and apparatus, including a computer program encoded on a computer storage medium, for generating a mesh of a scene from a text description of the scene using a signed distance function.

Inventors

クリスティーナ・ネクタリア・ツァリコグル
ファビアン・マンハルト
アレッシオ・トニオーニ
ミヒャエル・ニーマイヤー
フェデリコ・トンバリ

Assignees

グーグルエルエルシー

Dates

Publication Date: 20260511
Application Date: 20240419
Priority Date: 20230419

Claims (20)

A method performed by one or more computers, To obtain a text prompt, This includes generating the final three-dimensional (3D) mesh of the scene described by the text prompt, and the generation includes, This includes optimizing the parameters of a 3D model of the scene using a pre-trained image diffusion neural network conditioned on the text prompt, wherein the 3D model of the scene maps points in the scene and lines of sight from those points to the signed distance of the points and the color along the point-line of sight pairs, and further generates A method comprising generating a first 3D mesh according to the optimized parameters of the 3D model of the scene.
The process of generating the aforementioned final 3D mesh further involves, For each view in a first set of one or more views, the respective image of the scene from the view and the depth map of the scene from the view are rendered from the first 3D mesh using a differentiable renderer. To generate a second 3D mesh, the first 3D mesh is updated using the respective images and depth maps for each view within the first set of one or more views. The method according to claim 1, including the method described in claim 1.
Updating the first 3D mesh using the respective images and depth maps of the first set of one or more views in order to generate a second 3D mesh is: Using the respective images and depth maps for each view in the first set of one or more views, one or more pseudo-ground truth images are generated for each view in the first set. The first 3D mesh is updated by optimizing an object that measures the error between the pseudo-ground truth image and the image from the view rendered by the differentiable renderer from the second 3D mesh for each view in the first set and for each pseudo-ground truth image of the view. The method according to claim 2, including the method described in claim 2.
Using the respective images and depth maps for each view in the first set of one or more views, one or more pseudo-ground truth images are generated for each view in the first set. For each view, generate one or more pseudo-ground truth images of the view using a generative neural network conditioned on at least the respective images and depth maps of the view. The method according to claim 3, including the method described in claim 3.
The method according to claim 4, wherein the generative neural network is a depth-conditional diffuse neural network.
The first set includes multiple views, and for each view, one or more pseudo-ground truth images of the view are generated using a generative neural network conditioned on at least the respective images and depth maps of the view. (i) generating a tiled input which includes a tiled image having the respective images of the plurality of views in the first set, and (ii) a tiled depth map having the respective depth maps of the plurality of views in the first set. Using the generative neural network conditioned by the tiled input, a tiled pseudo-ground truth image is generated. For each view, the ground truth image of that view is extracted from the tiled pseudo-ground truth image, The method according to claim 4 or claim 5, including the method according to claim 4 or 5.
The process of generating the aforementioned final 3D mesh further involves, The method according to any one of claims 2 to 6, comprising updating the second 3D mesh to generate a third 3D mesh using a refinement loss gradient, wherein the refinement loss gradient includes, for each view in a second set of a plurality of views, a score distillation sampling (SDS) gradient generated from denoising output generated by a trained diffuse neural network with respect to a noise-containing image generated from an image of the scene from the view rendered from the third 3D mesh using the differentiable renderer.
Updating the second 3D mesh described above means For each view in the second set of views, the process includes generating a pseudo-ground truth image of the scene from the view by rendering each image of the scene from the view from the second 3D mesh using the differentiable renderer, The gradient of the refinement loss includes, for each view in the second set of views, the gradient of the photometric loss between the pseudo-ground truth image of the scene from the view and the image of the scene from the view rendered from the third 3D mesh using the differentiable renderer. The method according to claim 7.
The method according to claim 8, wherein the photometric loss is the mean squared error loss between the pseudo-ground truth image of the scene from the view and the image of the scene from the view rendered from the third 3D mesh using the differentiable renderer.
To generate a first 3D mesh according to the optimized parameters of the 3D model of the scene, Extracting the first 3D mesh as the surface in the zero-level set of signed distances generated by the 3D model of the scene according to the optimized parameters of the 3D model of the scene, The method according to any one of claims 1 to 9, including the method described in any one of claims 1 to 9.
Extracting the first 3D mesh as a surface in the zero-level set of signed distances generated by the 3D model of the scene according to the optimized parameters of the 3D model of the scene is: Using a marching cube, extract the first 3D mesh. The method according to claim 10, including the method described in claim 10.
Extracting the first 3D mesh as a surface in the zero-level set of signed distances generated by the 3D model of the scene according to the optimized parameters of the 3D model of the scene is: Selecting the largest mesh component closest to the center of the volume defined by the signed distance generated by the 3D model of the scene, according to the optimized parameters of the 3D model of the scene. The method according to claim 10 or claim 11, including the method described in claim 11.
Using the pre-trained image diffusion neural network conditioned by the aforementioned text prompt, optimizing the parameters of the 3D model of the scene is: The method according to any one of claims 1 to 12, comprising optimizing the parameters with respect to a noise-containing image generated from an image of the scene generated from the 3D model of the scene, using a score distillation sampling (SDS) gradient generated using the denoising output generated by the trained image diffusion neural network.
Optimizing the aforementioned parameters Sampling one or more sets of camera poses, For each camera pose, Using the 3D model of the scene, an image of the scene is generated from the camera pose according to the parameters. To generate one or more noise-containing images from the aforementioned images of the aforementioned scene, Using the aforementioned trained image diffusion neural network, a denoising output is generated for each noise-containing image. The SDS gradient of the camera pose is determined from the respective noise reduction outputs for each of the noise-containing images. The parameters are updated using the SDS gradient of one or more camera poses within the set. The method according to claim 13, which includes repeatedly performing an operation including the operation.
Generating one or more noise-containing images from the aforementioned image of the aforementioned scene is performed for each noise-containing image, Sampling random normal noise and time steps, The noise-containing image is generated by combining the random normal noise and the image of the scene according to the aforementioned time step, The method according to claim 14, including the method described in claim 14.
Using the aforementioned trained image diffusion neural network, generating a denoising output for each noise-containing image is: Using the trained diffuse neural network, process a first diffuse input, including the noisy image and the text prompt, to generate an initial denoising output of the noisy image. The method according to claim 14 or 15, including the method described in claim 14 or 15.
Using the aforementioned trained image diffusion neural network to generate a denoising output for each noise-containing image is further, Using the aforementioned trained spreading neural network, a second spreading input, which includes the noise-containing image but does not include the text prompt, is processed to generate an initial unconditionally denoised output of the noise-containing image. To generate the denoised output, the initial denoised output and the initial unconditional denoised output are combined according to guidance weights, The method according to claim 16, including the method described in claim 16.
Determining the SDS gradient of the camera pose from the respective noise reduction outputs for each of the noise-containing images is as follows: With respect to the parameters of the 3D model, the gradient of the image of the scene from the camera pose is determined, The SDS gradient is determined using the respective noise reduction outputs for each of the noise-containing images, the target image, and the gradient of the image of the scene from the camera pose. The method according to any one of claims 14 to 17, including the method described in any one of claims 14 to 17.
One or more computers, A system comprising one or more storage devices for storing instructions, wherein when an instruction is executed by the one or more computers, the system causes the one or more computers to perform the respective operations of the method according to any one of claims 1 to 18.
One or more computer-readable storage media for storing instructions, wherein, when executed by one or more computers, the instructions cause the one or more computers to perform the respective operations of the method according to any one of claims 1 to 18.

Description

Cross-reference of related applications This application is a non-provisional patent application to Greek Patent Application No. 20230100331, filed on 19 April 2023, claiming priority thereto, the entire contents of which are incorporated herein by reference. This specification relates to generating scene representations using machine learning models. As an example, a neural network is a machine learning model that uses one or more layers of nonlinear units to predict an output for a given input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to other layers in the network, for example, the next hidden layer or output layer. Each layer of the network generates an output from the received input according to the current values of its own weight set. This specification describes a system for generating a three-dimensional (3D) mesh of a scene from text prompts, which is implemented as a computer program on one or more computers. The subject matter described herein may be implemented in particular embodiments to achieve one or more of the following advantages: The ability to generate highly realistic 2D images from simple text prompts has recently advanced significantly in terms of speed and quality, thanks to the emergence of image diffusion models. However, attempting to use a diffusion model trained on 2D images to monitor 3D model generation using view-dependent prompts has a significant drawback. For example, these methods generate neural radiance fields (NeRFs) instead of the commonly used 3D meshes, making them impractical for most real-world applications. As another example, these methods tend to produce supersaturated models, resulting in a cartoonish appearance in the output. To address these issues, this specification describes techniques for generating highly photorealistic-looking 3D meshes from text prompts. To this end, the technique described here extends NeRF to employ a signed distance function (SDF) backbone, resulting in improved 3D mesh extraction. Specifically, the system uses a 3D model of the scene to predict the signed distances of input points. Once the 3D model of the scene is optimized for a given input text prompt, the quality of the extracted mesh can be significantly improved. Furthermore, this specification describes techniques for fine-tuning the extracted mesh texture to eliminate the effects of high saturation and improve the detail of the output 3D mesh. Specifically, in some cases, one or more computers performing mesh generation may have constrained memory space, for example, due to other processes running on the computer or due to the underlying computer hardware. However, as described later, optimizing a 3D model using, for example, SDS gradients, may require repeatedly rendering images by submitting a considerable number of queries to a neural network-based 3D model and performing inference using a pixel-level diffusion model. As a result, this optimization can be memory-intensive. Therefore, to keep the system within the memory constraints of the hardware in which it is deployed, the system must use relatively low-resolution images, and in some cases, relatively high guidance weights may be required to perform the optimization. When optimization is performed, if the optimization is performed at a low resolution, the first 3D mesh may lose high-resolution detail. Taking this into consideration, the technique described herein can fine-tune the extracted mesh texture to, for example, remove the effects of high saturation and improve the detail of the output 3D mesh, as described herein, thereby generating a high-quality final mesh while remaining within memory constraints. Details of one or more embodiments of the subject matter of this specification are described in the accompanying drawings and the following description. Other features, aspects, and advantages of the subject matter will become apparent from the description, drawings, and claims. This is a block diagram of an exemplary mesh generation system.This is a flowchart illustrating an exemplary process for generating the final mesh.This is a flowchart illustrating an exemplary process for optimizing the parameters of a 3D model.This is a flowchart illustrating an exemplary process for generating a second mesh from a first mesh.This is a flowchart illustrating an exemplary process for generating a third mesh from a second mesh.An example of the system's operation is shown. Similar reference numbers and names in various drawings refer to the same elements. Figure 1 is a block diagram of an exemplary mesh generation system 100. The mesh generation system 100 is an example of a system implemented as a computer program on one or more computers in one or more locations, and it implements the systems, components, and techniques described later. System 100 receives the text prompt 102, processes the text prompt 102, and generates a t