US-20260126805-A1 - BIPEDAL ACTION MODEL FOR HUMANOID ROBOT

US20260126805A1US 20260126805 A1US20260126805 A1US 20260126805A1US-20260126805-A1

Abstract

The present disclosure provides a humanoid robot system comprising a mechanical structure including a torso, two arms, and two legs providing at least 30 degrees of freedom, actuators coupled to the degrees of freedom, a sensor suite comprising at least one camera and proprioceptive sensors including joint encoders and an inertial measurement unit, a computing system comprising at least one processor and memory storing instructions which, when executed, implement a hierarchical bipedal action model including a Beta model configured to receive multimodal input data and generate a token sequence indicative of task intent and environmental state, and an Alpha model configured to condition on the token sequence and current robot pose data to output continuous action chunks comprising sequences of future target joint states over a finite horizon, and a low-level controller configured to convert the continuous action chunks into actuator control signals for execution.

Inventors

Corey Lynch
Toki Migimatsu
Michael Ahn

Assignees

FIGURE AI INC.

Dates

Publication Date: 20260507
Application Date: 20251020

Claims (20)

1 - 15 . (canceled)
16 . A method for controlling a humanoid robot using a hierarchical bipedal action model (BAM), comprising: obtaining a hierarchical BAM that is generated by: collecting training data comprising: (i) obtaining video data from a 3 rd party database, and (ii) obtaining real-world robot demonstrations; preprocessing the training data to form distinct segments that have natural language descriptions; training, using the preprocessed training data, both a Beta model and an Alpha model; and deploying the trained hierarchical BAM on a humanoid robot; and controlling the humanoid robot to perform an autonomous task using the trained hierarchical BAM.
17 . (canceled)
18 . (canceled)
19 . The method of claim 16 , wherein deploying the hierarchical BAM comprises loading the Alpha model on a GPU that is positioned within the humanoid robot.
20 . The method of claim 16 , wherein at least one of the Alpha model and the Beta model is a diffusion model.
21 - 25 . (canceled)
26 . The method of claim 16 , wherein the Beta model operates at a frequency that is less than 50 Hz.
27 . The method of claim 26 , wherein the Beta model is trained using a cross-entropy loss function and has more than 500 million parameters.
28 . The method of claim 16 , wherein the Alpha model operates at a frequency that is greater than 50 Hz.
29 . The method of claim 28 , wherein the Alpha model is trained using a regression-based loss function and has fewer than 500 million parameters.
30 . The method of claim 16 , wherein the Alpha and Beta models are trained end-to-end, and wherein said training includes allowing error gradients from the output of the Alpha model to be backpropagated through the Beta model.
31 . The method of claim 16 , further comprising the step of executing a safety verification that rejects or truncates outputs from the Alpha model that violate a predetermined constraint.
32 . The method of claim 16 , wherein the step of collecting training data includes capturing first-person video data synchronized with head and hand positions using a virtual reality or augmented reality headset.
33 . The method of claim 16 , wherein the natural language descriptions are generated using an AI model.
34 . The method of claim 16 , wherein obtaining a hierarchical BAM includes the step of obtaining at least one pre-trained model having a set of parameters; and wherein training the hierarchical BAM includes using supervised learning to modify the set of parameters based in part on the preprocessed training data.
35 . The method of claim 19 , wherein deploying the trained hierarchical BAM includes loading the Beta model on a GPU that is not positioned within the humanoid robot.
36 . A method for controlling a robot using a hierarchical action model, comprising: obtaining a hierarchical action model by: collecting training data comprising obtaining: (i) video data from a 3 rd party database, (ii) first-person video data, and (iii) data from a robot demonstration; training, using the training data, a hierarchical action model that includes an Alpha model and a Beta model, and wherein the Alpha model has a first number of parameters and the Beta model has a second number of parameters that is larger than the first number of parameters; and deploying the trained hierarchical action model on a robot; and controlling the robot to perform an autonomous task using the trained hierarchical action model.
37 . The method of claim 36 , wherein controlling the robot includes the step of using a retrieval-augmented generation technique to obtain additional real-time knowledge from external sources.
38 . The method of claim 36 , wherein at least one of the Alpha and Beta models is a diffusion model.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is: (i) a continuation-in-part of U.S. patent application Ser. No. 19/325,486, filed Oct. 2, 2025, and (ii) claims the benefit of and priority to U.S. Provisional Patent Application Nos. 63/715,270, filed Nov. 1, 2024, 63/722,057, filed Nov. 18, 2024, 63/725,279, filed Nov. 26, 2024, 63/760,617, filed Feb. 19, 2025, 63/776,429, filed Mar. 24, 2025, 63/819,533, filed Jun. 6, 2025 and 63/883,647, filed Sep. 17, 2025, each of which is fully incorporated herein by reference. TECHNICAL FIELD This disclosure relates to system design and architectures, methods and techniques for training and using a hierarchical bipedal action model (BAM) to control a humanoid robot. BACKGROUND The field of robotics has long pursued the goal of creating humanoid robots capable of performing complex tasks in unstructured, human-centric environments. A significant challenge in this pursuit is the development of control systems that can manage the vast number of degrees of freedom (DoF) inherent in a humanoid form. Conventional robotic control systems have traditionally been limited in their scope and capability. Many existing models are narrowly focused, designed to control only a specific part of the robot, such as a 7-DoF end-effector or arm. This approach effectively treats the robot as a disembodied limb, failing to coordinate the entire body. As a result, such systems cannot perform actions that require dynamic balance, postural adjustments, or the use of the torso and legs to extend reach and navigate obstacles. The movements produced are often rigid and limited to a constrained set of pre-programmed motions. Furthermore, a common deficiency in conventional systems is their reliance on generating discrete, or “binned,” action outputs. This method breaks down continuous motion into a finite set of poses or commands. The result is often jerky, imprecise, and unnatural movement, akin to a video with a low frame rate. This discretization introduces compounding errors over time, causing the robot to deviate from its intended path and struggle with tasks requiring fluid, continuous adjustments. These systems lack the temporal consistency needed for smooth, long-horizon tasks and are not robust enough to adapt to the unpredictable nature of real-world environments. Therefore, a significant need exists for a more advanced control architecture that can overcome these fundamental limitations. There is a demand for a system that can provide comprehensive, whole-body control over a high-degree-of-freedom humanoid robot and generate continuous, real-time control outputs to produce fluid, human-like motion, thereby enabling more effective and reliable performance in complex, dynamic settings. SUMMARY According to an aspect of the present disclosure, a humanoid robot system is provided. The humanoid robot system comprises a mechanical structure including a torso, two arms, and two legs providing at least 30 degrees of freedom. The system comprises a plurality of actuators coupled to the degrees of freedom. The system comprises a sensor suite comprising at least one camera and one or more proprioceptive sensors including joint encoders and an inertial measurement unit (IMU). The system comprises a computing system comprising at least one processor and memory storing instructions which, when executed, cause the computing system to implement a hierarchical bipedal action model (BAM) including a Beta model configured to receive multimodal input data and to generate a token sequence indicative of task intent and environmental state, and an Alpha model configured to condition on the token sequence and current robot pose data to output continuous action chunks each comprising a sequence of future target joint states over a finite horizon. The system comprises a low-level controller configured to convert the continuous action chunks into actuator control signals for execution by the plurality of actuators. According to other aspects of the present disclosure, the humanoid robot system may include one or more of the following features. The Beta model may comprise a vision-language model having between 1 billion and 50 billion trainable parameters and operating at 1-10 Hz, and the Alpha model may comprise a transformer-based architecture having between 50 million and 1 billion trainable parameters and operating at 50-350 Hz. Each continuous action chunk may span 8-32 future timesteps over a horizon of 50-200 ms, each timestep specifying a target joint position and at least one of a target velocity, acceleration, or torque. The Alpha model may comprise a sequence model that cross-attends to the task-conditioning representation configured to iteratively denoise latent action proposals, and overlapping action chunk predictions may be fused by weighted averaging to produce the continuous action chunks. The multimodal input data may comprise visual frames from the at least one camera, proprioceptive