DE-102025143584-A1 - Generating simulation-ready virtual characters from natural language input

DE102025143584A1DE 102025143584 A1DE102025143584 A1DE 102025143584A1DE-102025143584-A1

Abstract

The disclosed method for training machine learning models for object generation includes performing one or more operations based on object data to train an untrained machine learning model to generate a trained machine learning model comprising a trained encoder and a trained decoder, wherein the trained machine learning model is trained to generate an object surface representation; performing one or more operations based on the object data and natural language data to train an untrained diffusion model to generate a trained diffusion model, wherein the trained diffusion model is trained to generate an object geometry embedding; and wherein the trained diffusion model and the trained decoder are used to generate a virtual object based on natural language input.

Inventors

Xueting Li
Umar Iqbal
Ye Yuan
Jan Kautz
Shalini De Mello
Miles MACKLIN
Jonathan Christian Leaf
Gilles Daviet

Assignees

NVIDIA CORPORATION

Dates

Publication Date: 20260513
Application Date: 20251024
Priority Date: 20250922

Claims (20)

A computer-implemented method for training machine learning models to generate objects, wherein the method comprises: performing one or more operations based on object data to train an untrained machine learning model to generate a trained machine learning model comprising a trained encoder and a trained decoder, wherein the trained machine learning model is trained to generate an object surface representation; and performing one or more operations based on the object data and natural language data to generate an untrained diffusion to train a model to generate a trained diffusion model, wherein the trained diffusion model is trained to generate an object geometry embedding; and wherein the trained diffusion model and the trained decoder are used to generate a virtual object based on a natural language input.
Computer-implemented method according to Claim 1 , wherein performing one or more operations to train the untrained machine learning model to generate the trained machine learning model includes: generating an object geometry and an initial object surface representation based on the object data; generating an initial object geometry embedding based on the object geometry using an untrained encoder; generating a reconstruction of the initial object surface representation based on the initial object geometry embedding using an untrained decoder; calculating a loss based on the initial object geometry embedding, the reconstruction of the initial object surface representation, and the initial object surface representation; and updating one or more parameters of the untrained encoder and the untrained decoder based on the loss.
Computer-implemented method according to Claim 2 , wherein the loss comprises at least one of: a binary cross-entropy loss based on a predicted unsigned distance field, UDF, included in the reconstruction of the first object surface representation, and a ground-truth UDF included in the first object surface representation; an L2 gradient loss between one or more spatial gradients of the predicted UDF and the ground-truth UDF at one or more query points; or a Kullback-Leibler, KL, divergence loss based on one or more latent variables included in the first object geometry embedding.
A computer-implemented method according to any one of the preceding claims, wherein performing one or more operations to generate the trained diffusion model comprises: Generating a speech embedding based on the natural language data; Generating an object geometry based on the object data; Generating a first object geometry embedding based on the object geometry using the trained encoder; Adding noise to the first object geometry embedding to generate a noisy object geometry embedding; Performing one or more denoising steps using an untrained diffusion model to generate a predicted object geometry embedding based on the noisy object geometry embedding; Computing a loss based on the first object geometry embedding and the predicted object geometry embedding; and Updating one or more parameters of the untrained diffusion model based on the loss.
Computer-implemented method according to Claim 4 , where the loss includes a mean squared error loss between the predicted object geometry embedding and the first object geometry embedding.
A computer-implemented method according to one of the preceding claims, wherein performing one or more operations to train the untrained diffusion model to generate the trained diffusion model comprises performing one or more layer-wise training operations to disentangle one or more objects from one or more other components.
Computer-implemented method according to Claim 6 , wherein performing one or more layer-wise training operations involves training one or more separate visual layers of the untrained diffusion model.
Computer-implemented method according to Claim 6 or 7 , wherein performing one or more layer-wise training operations includes: rendering one or more magnified views of objects; and combining the one or more magnified views of objects with one or more object-specific prompts contained in the natural language data.
A computer-implemented method according to any of the preceding claims, wherein the creation of the virtual object comprises: generating a language embedding based on the natural language input; and generating an object geometry based on the language embedding using the trained diffusion model and the trained decoder.
Computer-implemented method according to Claim 9 , furthermore, encompassing: generating a body geometry based on the language embedding; generating a hair geometry based on the language embedding; performing one or more optimization steps based on the body geometry, hair geometry, object geometry, and natural language input to generate an optimized appearance of the character; and generating a virtual character based on the optimized appearance of the character.
One or more non-volatile, machine-readable media that store instructions which, when executed by one or more processors, cause the one or more processors to perform the following steps: Performing one or more operations based on object data to train an untrained machine learning model to generate a trained machine learning model comprising a trained encoder and a trained decoder, wherein the trained machine learning model is trained to generate an object surface representation; and Performing one or more operations based on the object data and natural language data to train an untrained diffusion model to generate a trained diffusion model, wherein the trained diffusion model is trained to generate an object geometry embedding, where the trained diffusion model and the trained decoder are used to generate a virtual object based on natural language input.
The one or more non-volatile, computer-readable media according to Claim 11 , wherein performing one or more operations to train the untrained machine learning model to generate the trained machine learning model includes: generating an object geometry and an initial object surface representation based on the object data; generating an initial object geometry embedding based on the object geometry using an untrained encoder; generating a reconstruction of the initial object surface representation based on the initial object geometry embedding using an untrained decoder; calculating a loss based on the initial object geometry embedding, the reconstruction of the initial object surface representation, and the initial object surface representation; and updating one or more parameters of the untrained encoder and the untrained decoder based on the loss.
The one or more non-volatile, computer-readable media according to Claim 11 or 12 , where performing one or more operations to generate the trained diffusion model includes: generating a speech embedding based on the natural language data; generating an object geometry based on the object data; generating an initial object geometry embedding based on the object geometry using the trained encoder; adding noise to the initial object geometry embedding to generate a noisy object geometry embedding; performing one or more denoising steps using an untrained diffusion model to generate a predicted object geometry embedding based on the noisy object geometry embedding; calculating a loss based on the initial object geometry embedding and the predicted object geometry embedding; and updating one or more parameters of the untrained diffusion model based on the loss.
One or more non-volatile, computer-readable media according to one of the Claims 11 until 13 , wherein performing one or more operations to train the untrained diffusion model to generate the trained diffusion model involves performing one or more layer-wise training operations to disentangle one or more objects from one or more other components.
The one or more non-volatile, computer-readable media according to Claim 14 , wherein performing one or more layer-wise training operations includes generating one or more object-related prompts that avoid entangling an object geometry with one or more non-object geometries.
One or more non-volatile, computer-readable media according to one of the Claims 11 until 15 , where the trained diffusion model comprises an elucidated diffusion model.
One or more non-volatile, computer-readable media according to one of the Claims 11 until 16 , whereby the creation of the virtual Objects includes: generating a language embedding based on natural language input; and generating an object geometry based on the language embedding using the trained diffusion model and the trained decoder.
The one or more non-volatile, computer-readable media according to Claim 17 , where generating the object geometry includes: generating a predicted object geometry embedding based on the language embedding using the trained diffusion model; generating an initial object surface representation based on the predicted object geometry embedding; and generating the object geometry based on the initial object surface representation.
The one or more non-volatile, computer-readable media according to Claim 17 , wherein the instructions, when executed by the one or more processors, further cause the one or more processors to perform the following steps: generating a body geometry based on the language embedding; generating a hair geometry based on the language embedding; performing one or more optimization steps based on the body geometry, hair geometry, object geometry, and natural language input to produce an optimized appearance of the character; and generating a virtual character based on the optimized appearance of the character.
A system comprising: one or more memories that store instructions, and one or more processors coupled to the one or more memories and, when executing the instructions, configured to: perform one or more operations based on object data, to train an untrained machine learning model, to generate a trained machine learning model comprising a trained encoder and a trained decoder, wherein the trained machine learning model is trained to generate an object surface representation, and perform one or more operations based on the object data and natural language data, to train an untrained diffusion model, to generate a trained diffusion model, wherein the trained diffusion model is trained to generate an object geometry embedding, where the trained diffusion model and the trained decoder are used to generate a virtual object based on natural language input.

Description

BACKGROUND Technical field Embodiments of the present disclosure generally relate to computer science, artificial intelligence and machine learning, and in particular to methods for generating simulation-ready virtual characters from natural language input. Description of the state of the art Virtual character generation refers to the use of computational algorithms to create digital representations of characters for use in interactive or rendered environments, such as games, simulations, animated media, virtual reality, and/or the like. Virtual characters can include, but are not limited to, virtual humans, animals, fantastical creatures, humanoid robots, or other stylized or realistic entities. Virtual character generation systems are often integrated into real-time applications, such as video games, augmented reality (AR)/virtual reality (VR) experiences, and/or the like, or are used in offline pipelines for film production, digital twin simulation, synthetic data generation, and/or the like. Traditional approaches to virtual character creation often use template-based pipelines and manually defined asset hierarchies to construct virtual characters from a set of predefined components. In such approaches, the character creation process is typically divided into different modules for modeling the base geometry, adding surface features such as clothing or hair, and assigning textures or materials. The base geometry module defines the underlying skeletal or mesh structure, often derived from parametric body models or scanned templates. The clothing and hair modules then add geometry that conforms to the base mesh, using predefined binding rules or mesh deformation techniques. Texture mapping and material assignment modules apply visual properties to each surface, either procedurally or using templates defined by an artist. For example, traditional virtual character creation approaches can use standard skinning and rigging techniques to animate characters and procedural tools to generate clothing layers based on user-selected parameters. A drawback of the aforementioned approaches to virtual character generation is their reliance on manually defined asset hierarchies and predefined geometry templates, which limits the ability to generalize across different character types, poses, and appearances. With flexible content generation settings, a virtual world requires virtual characters that vary significantly in body shape, clothing style, or surface complexity, or that dynamically respond to user input or physical simulations. For example, a video game might feature a wide variety of non-human characters, each with distinct anatomy and outer shell, while a virtual production pipeline might require a single character to appear in different outfits or hairstyles across different scenes. Virtual character generation systems that rely on fixed mesh topologies or template-driven pipelines often require extensive manual adjustment or redesign to support such diversity and are less suitable for large-scale generation or dynamic simulation. Another drawback of the aforementioned approaches is that rigid binding and deformation schemes can complicate the integration of advanced rendering or physics models, especially when clothing or hair needs to react independently to environment- or character-specific movements. For example, rigidly bound clothing may stretch unnaturally or remain static, failing to exhibit realistic secondary movement, in scenarios where a character is animated by dynamic actions such as jumping or turning. In more extreme cases, rigid binding and deformation schemes can even create artifacts, such as clothing tearing, floating fabric areas, or stiff, unresponsive strands of hair, all of which diminish the visual realism and physical plausibility of the character's appearance. As the above illustrates, more effective methods for generating virtual characters are needed in the prior art. SUMMARY According to some embodiments, a computer-implemented method for training machine learning models to create objects involves performing one or more operations based on object data to generate a The method involves training an untrained machine learning model to generate a trained machine learning model comprising a trained encoder and a trained decoder, the trained machine learning model being trained to generate an object surface representation. The method further involves performing one or more operations based on the object data and natural language data to train an untrained diffusion model to generate a trained diffusion model, the trained diffusion model being trained to generate an object geometry embedding. The trained diffusion model and the trained decoder are used to generate a virtual object based on natural language input. According to some embodiments, a computer-implemented method for generating a virtual object involves processing a language embedding associated with a natural language descript