US-20260126804-A1 - BIPEDAL ACTION MODEL FOR HUMANOID ROBOT
Abstract
The present disclosure provides a humanoid robot system comprising a mechanical structure with at least 30 degrees of freedom across torso, arms, and legs, actuators driving the degrees of freedom, sensors including cameras and proprioceptive sensors, and a computing system implementing a hierarchical bipedal action model (BAM). The BAM includes: a Delta model processing sensor data and user input to generate latent representations at a first frequency; a Gamma model receiving latent representations to generate human task actions at a higher second frequency; a Beta model translating task actions into joint configurations at a higher third frequency; and an Alpha model converting joint configurations into actuator control signals at a higher fourth frequency.
Inventors
- Corey Lynch
- Toki Migimatsu
- Michael Ahn
Assignees
- FIGURE AI INC.
Dates
- Publication Date
- 20260507
- Application Date
- 20251020
Claims (20)
- 1 . A humanoid robot system for whole-body continuous control, comprising: a mechanical structure providing at least 30 degrees of freedom across a torso, arms, and legs; a plurality of actuators configured to drive the degrees of freedom; a sensor suite comprising at least one camera and proprioceptive sensors; and a distributed computing system implementing a hierarchical bipedal action model (BAM) to decouple high-level reasoning from low-level reactive control, the distributed computing system including: a remote computing system configured to execute a Delta model that processes robot sensor data and user input to generate latent space representations, the Delta model operating at a first frequency a computing system configured to execute a Gamma model that receives the latent space representations and generates human task space actions, the Gamma model operating at a second frequency higher than the first frequency; and an onboard computing system integrated into the humanoid robot, the onboard computing system configured to execute: a Beta model that translates the human task space actions into robot joint space configurations, the Beta model operating at a third frequency higher than the second frequency; and an Alpha model that converts the robot joint space configurations into continuous control signals for the plurality of actuators to effectuate whole-body motion, the Alpha model operating at a fourth frequency higher than the third frequency.
- 2 . The humanoid robot system of claim 1 , wherein the human task space actions generated by the Gamma model are defined in a generalized, robot-agnostic coordinate system derived from human demonstration data.
- 3 . The humanoid robot system of claim 1 , wherein the Beta model is configured to generate the robot joint space configurations as sequential overlapping action chunks, each chunk defining a trajectory over a future time horizon.
- 4 . The humanoid robot system of claim 3 , wherein a subsequent action chunk is predicted asynchronously while a current chunk is executing, and wherein a suffix of the currently executing chunk is used to condition the generation of the subsequent chunk to ensure a smooth transition.
- 5 . The humanoid robot system of claim 1 , wherein the Beta model is trained using reinforcement learning to refine motions to be dynamically feasible and self-collision-free for the mechanical structure.
- 6 . The humanoid robot system of claim 1 , wherein the onboard computing system is further configured to perform a safety validation process on the continuous control signals, the process including kinematic-limit checks and self-collision checks, prior to execution by the plurality of actuators.
- 7 . The humanoid robot system of claim 1 , wherein the Gamma model is trained on a dataset comprising virtual reality demonstration data.
- 8 . The humanoid robot system of claim 1 , wherein the robot joint-space action chunks comprise target joint positions for the plurality of actuators.
- 9 . The humanoid robot system of claim 1 , wherein the Gamma model executes on a local edge server or docking station in wireless communication with the humanoid robot, the Beta model and the Alpha model execute on-board the humanoid robot, and the Delta model executes remotely, to reduce control-loop latency while preserving high-level reasoning capability.
- 10 . The humanoid robot system of claim 1 , wherein the BAM is reusable across different robot embodiments by sharing the Delta and Gamma models across platforms and retraining only the Beta and Alpha models to account for embodiment-specific kinematics.
- 11 . The humanoid robot system of claim 1 , wherein each action chunk specifies per-degree-of-freedom targets for at least one of position, velocity, and torque over a horizon of 0.1 to 150 seconds.
- 12 . A method for controlling a humanoid robot using a hierarchical bipedal action model (BAM), the method comprising: receiving robot sensor data from a sensor suite of the humanoid robot and a user input; on a remote computing system, processing the robot sensor data and user input with a Delta model to generate a continuous latent vector, the Delta model operating at a first frequency; generating human task space actions from the continuous latent vector using a Gamma model operating at a second frequency higher than the first frequency; onboard the humanoid robot, translating the human task space actions into robot joint space configurations using a Beta model operating at a third frequency higher than the second frequency; onboard the humanoid robot, converting the robot joint space configurations into continuous control signals using an Alpha model operating at a fourth frequency higher than the third frequency; and executing the continuous control signals to control a plurality of actuators of the humanoid robot having at least 30 degrees of freedom across a torso, arms, and legs.
- 13 . The method of claim 12 , wherein translating the human task space actions comprises generating sequential overlapping action chunks of whole-body control conditioned on the human-task-space actions and a current robot state.
- 14 . The method of claim 13 , further comprising fusing predictions from overlapping action chunks for the same future timestep using a recency-weighted averaging method before execution.
- 15 . The method of claim 12 , further comprising: performing a safety validation on the continuous control signals, wherein the validation includes kinematic-limit and collision checks; and in response to the validation failing, repeating the steps of generating, translating, and converting to determine a new, valid set of continuous control signals.
- 16 . The method of claim 12 , wherein the Gamma model is trained on a dataset of human demonstration data using a diffusion-style or flow-matching denoising objective.
- 17 . The method of claim 12 , wherein the Beta model is trained using reinforcement learning with a reward function that penalizes dynamic infeasibility and self-collisions.
- 18 . The method of claim 12 , wherein processing with the Delta model comprises temporally aligning language and image embeddings with proprioceptive histories using a cross-attention mechanism.
- 19 . The method of claim 12 , wherein fewer than all degrees of freedom are specified by the robot joint space configurations, and wherein missing degrees of freedom are synthesized by a whole-body controller using inverse kinematics.
- 20 . The method of claim 12 , wherein generating human task space actions using the Gamma model is performed on a local server in communication with the humanoid robot and the remote computing system.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS This application is: (i) a continuation-in-part of U.S. patent application Ser. No. 19/325,486, filed Oct. 2, 2025, and (ii) claims the benefit of and priority to U.S. Provisional Patent Application Nos. 63/715,270, filed Nov. 1, 2024, 63/722,057, filed Nov. 18, 2024, 63/725,279, filed Nov. 26, 2024, 63/760,617, filed Feb. 19, 2025, 63/776,429, filed Mar. 24, 2025, 63/819,533, filed Jun. 6, 2025 and 63/883,647, filed Sep. 17, 2025, each of which is fully incorporated herein by reference. TECHNICAL FIELD This disclosure relates to system design and architectures, methods and techniques for training and using a hierarchical bipedal action model (BAM) to control a humanoid robot. BACKGROUND The field of robotics has long pursued the goal of creating humanoid robots capable of performing complex tasks in unstructured, human-centric environments. A significant challenge in this pursuit is the development of control systems that can manage the vast number of degrees of freedom (DoF) inherent in a humanoid form. Conventional robotic control systems have traditionally been limited in their scope and capability. Many existing models are narrowly focused, designed to control only a specific part of the robot, such as a 7-DoF end-effector or arm. This approach effectively treats the robot as a disembodied limb, failing to coordinate the entire body. As a result, such systems cannot perform actions that require dynamic balance, postural adjustments, or the use of the torso and legs to extend reach and navigate obstacles. The movements produced are often rigid and limited to a constrained set of pre-programmed motions. Furthermore, a common deficiency in conventional systems is their reliance on generating discrete, or “binned,” action outputs. This method breaks down continuous motion into a finite set of poses or commands. The result is often jerky, imprecise, and unnatural movement, akin to a video with a low frame rate. This discretization introduces compounding errors over time, causing the robot to deviate from its intended path and struggle with tasks requiring fluid, continuous adjustments. These systems lack the temporal consistency needed for smooth, long-horizon tasks and are not robust enough to adapt to the unpredictable nature of real-world environments. Therefore, a significant need exists for a more advanced control architecture that can overcome these fundamental limitations. There is a demand for a system that can provide comprehensive, whole-body control over a high-degree-of-freedom humanoid robot and generate continuous, real-time control outputs to produce fluid, human-like motion, thereby enabling more effective and reliable performance in complex, dynamic settings. SUMMARY The presently disclosed subject matter is directed to a humanoid robot system. Particularly, the system comprises a mechanical structure providing at least 30 degrees of freedom across a torso, arms, and legs. The system comprises a plurality of actuators configured to drive the degrees of freedom. The system comprises a sensor suite comprising at least one camera and proprioceptive sensors. The system comprises a computing system comprising at least one processor and memory storing instructions which, when executed, cause the computing system to implement a hierarchical bipedal action model (BAM) including a Delta model configured to process robot sensor data and user input to generate latent space representations, the Delta model operating at a first frequency, a Gamma model configured to receive the latent space representations and generate human task space actions, the Gamma model operating at a second frequency higher than the first frequency, a Beta model configured to translate the human task space actions into robot joint space configurations, the Beta model operating at a third frequency higher than the second frequency, and an Alpha model configured to convert the robot joint space configurations into continuous control signals for the plurality of actuators, the Alpha model operating at a fourth frequency higher than the third frequency. The presently disclosed subject matter is directed to a method for controlling a humanoid robot using a hierarchical bipedal action model (BAM). Particularly, the method comprises receiving robot sensor data from a sensor suite of the humanoid robot and user input. The method comprises processing the robot sensor data and user input with a Delta model to generate latent space representations, the Delta model operating at a first frequency. The method comprises generating human task space actions from the latent space representations using a Gamma model operating at a second frequency higher than the first frequency. The method comprises translating the human task space actions into robot joint space configurations using a Beta model operating at a third frequency higher than the second frequency. The method comprises converting the robot joint