US-20260127804-A1 - AUTOMATED CREATION OF ANIMATION CONTROLLERS

US20260127804A1US 20260127804 A1US20260127804 A1US 20260127804A1US-20260127804-A1

Abstract

Some implementations relate to methods, systems, and computer readable media for automated creation of animation controllers. According to one aspect, a computer-implemented method includes obtaining one or more motion descriptions comprising natural-language prompts describing motions of an avatar and encoding the descriptions into a motion embedding vector. A noisy motion representation comprising sparse keyframes representing skeletal poses is generated and iteratively refined using a pre-trained diffusion model over a plurality of timesteps. At each timestep, a denoised motion representation is estimated, a keyframe mask is dynamically updated, and an updated motion representation is obtained. A final motion representation corresponding to a last timestep is used to generate one or more motion clips defining avatar movements of the avatar, which are assembled into an animation controller represented by a motion graph specifying transitions between animation states.

Inventors

Jinseok Bae
Mubbasir Turab Kapadia
Young Yoon LEE
Joseph Liu

Assignees

ROBLOX CORPORATION

Dates

Publication Date: 20260507
Application Date: 20251105

Claims (20)

1 . A computer-implemented method comprising: obtaining one or more motion descriptions comprising one or more natural-language prompts describing avatar motions; encoding the one or more motion descriptions into a motion embedding vector; generating a noisy motion representation based on the motion embedding vector, wherein the noisy motion representation comprises a sparse set of keyframes representing skeletal poses of an avatar over time; iteratively refining the noisy motion representation using a pre-trained diffusion model over a plurality of timesteps, wherein the iterative refinement comprises, at each timestep: estimating, by the pre-trained diffusion model configured to represent motion sequences using sparse keyframes, a denoised motion representation based on the noisy motion representation and the current timestep; dynamically updating a keyframe mask that identifies salient keyframes of the denoised motion representation based on variation in joint position or motion velocity over time; obtaining an updated motion representation as the noisy motion representation based on the denoised motion representation and a motion representation obtained at a previous timestep; and updating the timestep; obtaining a final motion representation corresponding to a last timestep of the iterative refinement; generating, based on the final motion representation, one or more motion clips defining avatar movements of the avatar; and assembling the one or more motion clips into an animation controller represented by a motion graph specifying one or more transitions between animation states.
2 . The computer-implemented method of claim 1 , wherein dynamically updating the keyframe mask comprises weighting keyframes based on magnitude of joint displacement or temporal motion energy.
3 . The computer-implemented method of claim 1 , wherein the pre-trained diffusion model comprises a neural network trained to jointly process sparse keyframes and corresponding temporal indices to maintain timing consistency across motion sequences.
4 . The computer-implemented method of claim 1 , wherein the noisy motion representation and the denoised motion representation each comprise skeletal pose data expressed as three-dimensional (3D) joint coordinates for a plurality of joints of the avatar.
5 . The computer-implemented method of claim 1 , further comprising: generating, using at least one secondary pre-trained diffusion model, one or more additional motion clips that are temporally aligned with the one or more motion clips generated by the pre-trained diffusion model to form a combined set of motion clips.
6 . The computer-implemented method of claim 5 , further comprising: generating, based on identified transition points between the combined set of motion clips, one or more intermediate motion frames using one or more interpolation techniques.
7 . The computer-implemented method of claim 6 , further comprising: synchronizing the combined set of motion clips using the one or more intermediate motion frames to obtain synchronized motion clips; and assembling the synchronized motion clips into the animation controller represented by the motion graph specifying one or more transitions between animation states.
8 . The computer-implemented method of claim 1 , wherein the pre-trained diffusion model is regularized using a Lipschitz-constrained loss to maintain bounded continuity of interpolated joint positions across two or more timesteps of the plurality of timesteps.
9 . The computer-implemented method of claim 1 , further comprising: augmenting a motion dataset used to train the pre-trained diffusion model by automatically assigning labels to unlabeled motion sequences with natural-language descriptions; and refining the labels using a language model.
10 . The computer-implemented method of claim 9 , wherein augmenting the motion dataset further comprises generating a plurality of varied motion sequences by procedurally modifying the unlabeled motion sequences to create additional sequences having variations in motion parameters.
11 . A computing device comprising: one or more processors; and memory coupled to the one or more processors with instructions stored thereon that, when executed by the one or more processors, cause the one or more processors to perform operations comprising: obtaining one or more motion descriptions comprising one or more natural-language prompts describing avatar motions; encoding the one or more motion descriptions into a motion embedding vector; generating a noisy motion representation based on the motion embedding vector, wherein the noisy motion representation comprises a sparse set of keyframes representing skeletal poses of an avatar over time; iteratively refining the noisy motion representation using a pre-trained diffusion model over a plurality of timesteps, wherein the iterative refinement comprises, at each timestep: estimating, by the pre-trained diffusion model configured to represent motion sequences using sparse keyframes, a denoised motion representation based on the noisy motion representation and the current timestep; dynamically updating a keyframe mask that identifies salient keyframes of the denoised motion representation based on variation in joint position or motion velocity over time; obtaining an updated motion representation as the noisy motion representation based on the denoised motion representation and a motion representation obtained at a previous timestep; and updating the timestep; obtaining a final motion representation corresponding to a last timestep of the iterative refinement; generating, based on the final motion representation, one or more motion clips defining avatar movements of the avatar; and assembling the one or more motion clips into an animation controller represented by a motion graph specifying one or more transitions between animation states.
12 . The computing device of claim 11 , wherein dynamically updating the keyframe mask comprises weighting keyframes based on magnitude of joint displacement or temporal motion energy.
13 . The computing device of claim 11 , wherein the pre-trained diffusion model comprises a neural network trained to jointly process sparse keyframes and corresponding temporal indices to maintain timing consistency across motion sequences.
14 . The computing device of claim 11 , wherein the noisy motion representation and the denoised motion representation each comprise skeletal pose data expressed as three-dimensional (3D) joint coordinates for a plurality of joints of the avatar.
15 . The computing device of claim 1 , wherein the instructions cause the one or more processors to perform a further operation comprising: generating, using at least one secondary pre-trained diffusion model, one or more additional motion clips that are temporally aligned with the one or more motion clips generated by the pre-trained diffusion model to form a combined set of motion clips.
16 . A non-transitory computer-readable medium with instructions stored thereon that, when executed by a processor, cause the processor to perform operations comprising: obtaining one or more motion descriptions comprising one or more natural-language prompts describing avatar motions; encoding the one or more motion descriptions into a motion embedding vector; generating a noisy motion representation based on the motion embedding vector, wherein the noisy motion representation comprises a sparse set of keyframes representing skeletal poses of an avatar over time; iteratively refining the noisy motion representation using a pre-trained diffusion model over a plurality of timesteps, wherein the iterative refinement comprises, at each timestep: estimating, by the pre-trained diffusion model configured to represent motion sequences using sparse keyframes, a denoised motion representation based on the noisy motion representation and the current timestep; dynamically updating a keyframe mask that identifies salient keyframes of the denoised motion representation based on variation in joint position or motion velocity over time; obtaining an updated motion representation as the noisy motion representation based on the denoised motion representation and a motion representation obtained at a previous timestep; and updating the timestep; obtaining a final motion representation corresponding to a last timestep of the iterative refinement; generating, based on the final motion representation, one or more motion clips defining avatar movements of the avatar; and assembling the one or more motion clips into an animation controller represented by a motion graph specifying one or more transitions between animation states.
17 . The non-transitory computer-readable medium of claim 16 , wherein dynamically updating the keyframe mask comprises weighting keyframes based on magnitude of joint displacement or temporal motion energy.
18 . The non-transitory computer-readable medium of claim 16 , wherein the pre-trained diffusion model comprises a neural network trained to jointly process sparse keyframes and corresponding temporal indices to maintain timing consistency across motion sequences.
19 . The non-transitory computer-readable medium of claim 16 , wherein the noisy motion representation and the denoised motion representation each comprise skeletal pose data expressed as three-dimensional (3D) joint coordinates for a plurality of joints of the avatar.
20 . The non-transitory computer-readable medium of claim 16 , wherein the instructions cause the one or more processors to perform a further operation comprising: generating, using at least one secondary pre-trained diffusion model, one or more additional motion clips that are temporally aligned with the one or more motion clips generated by the pre-trained diffusion model to form a combined set of motion clips.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims priority to U.S. Provisional Patent Application No. 63/716,641, filed Nov. 5, 2024, and titled “AUTOMATED CREATION OF ANIMATION CONTROLLERS,” the entire contents of which are incorporated by reference herein. TECHNICAL FIELD Various implementations described herein relate generally to computer-generated animation, and more particularly but not exclusively, relates to methods, systems, and computer-readable media for automating the creation of animation controllers. BACKGROUND The development of computer-generated animation systems, which can be used in interactive virtual experiences such as gaming experiences, has introduced numerous challenges in the creation and management of avatar animations. Responsive and temporally consistent animations are important for maintaining immersion in real-time or near-real-time applications where user commands are reflected promptly and accurately in avatar movements. Traditionally, creating an animation controller, a framework that coordinates and blends multiple animation clips based on user input or automated behavior, includes extensive manual configuration. The creation includes generating, editing, and aligning large sets of animation clips to achieve realistic transitions between motion states, such as walking, running, or jumping. Current techniques for building animation controllers rely on sequential manual steps, including, e.g., collecting motion data, editing animation clips to satisfy physical or logical constraints (such as loop continuity or ground contact alignment), and synchronizing transitions between clips. The procedures account for real-time blending and runtime variability in user input or predefined system behavior. As the number of animation states increases, the associated control logic and synchronization complexity grow exponentially, making the construction of animation controllers resource-intensive and prone to configuration errors. Generative artificial intelligence (AI) models, including those developed for motion synthesis, have been applied to automate portions of animation creation. The generative AI models lack sufficient control and consistency for production use. Outputs may include visual defects such as, e.g., unstable joint motion, foot sliding, or discontinuities across frames. Training datasets can be limited in scope and diversity, restricting model generalization to the broad range of motions in interactive environments. Some generative motion models depend on dense temporal sampling of pose data, which can obscure key poses that define a motion sequence. The dense representation can reduce temporal interpretability and make it difficult to generate transitions that align properly across multiple animation clips. The background description provided herein is for the purpose of presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the prior disclosure. SUMMARY Various implementations described herein relate to methods, systems, and computer-readable media to automate the creation of animation controllers. According to one aspect, a computer-implemented method includes obtaining one or more motion descriptions including one or more natural-language prompts describing avatar motions. The computer-implemented method further includes encoding the one or more motion descriptions into a motion embedding vector. The computer-implemented method further includes generating a noisy motion representation based on the motion embedding vector, where the noisy motion representation includes a sparse set of keyframes representing skeletal poses of an avatar over time. The computer-implemented method further includes iteratively refining the noisy motion representation using a pre-trained diffusion model over a number of timesteps, where the iterative refinement includes, at each timestep: estimating, by the pre-trained diffusion model configured to represent motion sequences using sparse keyframes, a denoised motion representation based on the noisy motion representation and the current timestep; dynamically updating a keyframe mask that identifies salient keyframes of the denoised motion representation based on variation in joint position or motion velocity over time; obtaining an updated motion representation as the noisy motion representation based on the denoised motion representation and a motion representation obtained at a previous timestep; and updating the timestep. The computer-implemented method further includes obtaining a final motion representation corresponding to a last timestep of the iterative refinement. The computer-implemented method further includes generating, based on the final motion representation, on