US-20260124746-A1 - TRAINING AND USE OF A BIPEDAL ACTION MODEL FOR HUMANOID ROBOT

US20260124746A1US 20260124746 A1US20260124746 A1US 20260124746A1US-20260124746-A1

Abstract

The present disclosure provides a method for controlling a humanoid robot using a hierarchical bipedal action model (BAM), the method comprising obtaining a base controller by training in simulation with reinforcement learning, instantiating an initial BAM including a Gamma model configured to generate intermediate goals, a Beta model configured to translate the intermediate goals into task-space actions, and an Alpha model configured to translate the task-space actions and robot state into motor commands, deploying the initial BAM such that at least the Alpha model executes on-board the humanoid robot, causing the humanoid robot to perform an initial task and logging sensor and control data to form a first dataset, based on the first dataset, training at least one policy of the BAM to generate a refined BAM, and deploying the refined BAM to control the humanoid robot autonomously.

Inventors

Corey Lynch
Toki Migimatsu
Yevgen Chebotar
Michael Ahn
Ivan Babushkin

Assignees

FIGURE AI INC.

Dates

Publication Date: 20260507
Application Date: 20251103

Claims (19)

1 . A method for controlling a humanoid robot using a hierarchical bipedal action model (BAM), the method comprising: obtaining a base controller by training in simulation with reinforcement learning; instantiating an initial BAM including a Gamma model configured to generate intermediate goals, a Beta model configured to translate the intermediate goals into task-space actions, and a Alpha model configured to translate the task-space actions and robot state into motor commands; deploying the initial BAM such that at least the Alpha model executes on-board the humanoid robot; causing the humanoid robot to perform an initial task and logging sensor and control data to form a first dataset; based on the first dataset, training at least one policy of the BAM to generate a refined BAM; and deploying the refined BAM to control the humanoid robot autonomously.
2 . The method of claim 1 , wherein the simulation training employs domain randomization of robot and environment parameters including one or more of geometry, mass distribution, actuator limits or friction, contact properties, and exogenous perturbations, and wherein a reward function includes at least a joint-pose accuracy term.
3 . The method of claim 1 , wherein the high-level, mid-level, and low-level policies operate at different update rates with the Alpha model updating more frequently than the Beta model, and the Beta model updating more frequently than the Gamma model.
4 . The method of claim 1 , wherein the Gamma model is a multimodal model configured to decompose natural-language instructions into intermediate goals, and the Beta model is a vision-language-action model configured to output task-space actions including one or more of target end-effector poses, base velocity commands, and gaze targets.
5 . The method of claim 1 , further comprising: detecting a failure to complete the initial task; autonomously generating one or more self-correcting tasks based on historical memory of the initial task and a current robot state; and collecting additional datasets while executing the one or more self-correcting tasks until the initial task is completed.
6 . The method of claim 5 , wherein generating the one or more self-correcting tasks comprises: (i) analyzing stored imagery and proprioception using object detection or state-estimation to determine task status, (ii) comparing current robot and object poses to target endpoints, and (iii) producing corrective action sequences to reach the target endpoints.
7 . The method of claim 1 , wherein training the refined BAM comprises reward-based learning using feedback selected from: human preference or ranking of actions, automated labeling model annotations identifying visual features, and scalar or binary task-success signals, and further comprises deploying the refined BAM upon satisfying an operator-specified success criterion.
8 . The method of claim 2 , wherein the reward function further includes one or more of: (i) angular-velocity consistency terms, (ii) momentum-management terms comparing centroidal linear and angular momentum to reference values, (iii) center-of-mass trajectory alignment, and (iv) energy-efficiency penalties based on torque-velocity power across joints.
9 . The method of claim 1 , wherein deploying the refined BAM comprises retaining the refined BAM in on-board memory for local execution and transmitting the refined BAM over a network for deployment on additional humanoid robots to enable fleet-wide updates.
10 . A system for controlling a humanoid robot using a hierarchical bipedal action model (BAM), comprising: a Gamma model configured to receive multimodal inputs and generate intermediate goals; a Beta model configured to translate the intermediate goals into task-space actions; a Alpha model configured to translate the task-space actions and robot state into motor commands; and wherein the Alpha model is trained with a reward function including at least one of joint-pose accuracy, angular-velocity consistency, momentum management, center-of-mass trajectory alignment, and energy-efficiency terms.
11 . The system of claim 10 , wherein the Gamma model comprises a vision-language model operable on image sequences and proprioceptive inputs and outputs the intermediate goals as natural-language descriptions or latent goal representations.
12 . The system of claim 10 , wherein the task-space actions include target end-effector poses and base locomotion commands, and the motor commands include at least one of joint positions, joint velocities, and joint torques.
13 . The system of claim 10 , wherein the policies execute at different update rates with a relative ordering of high-level<mid-level<low-level.
14 . The system of claim 10 , wherein the reward function includes: an exponentially-weighted pose error term, a velocity-consistency term based on an L2 norm, and a centroidal momentum term comparing linear and angular momentum to reference trajectories.
15 . The system of claim 10 , wherein the Beta model receives the intermediate goals as abstract latent representations and outputs probability distributions over discrete action tokens representing manipulation or loco-manipulation primitives.
16 . The system of claim 10 , further comprising a safety monitor configured to override or gate lower-level commands when safety thresholds for joint-torque limits, contact forces, or stability margins are exceeded.
17 . The system of claim 16 , wherein the joint-torque limits are enforced with soft constraints that increase resistance as torques approach a configurable fraction of a rated maximum.
18 . The system of claim 10 , wherein the Gamma model includes a vision encoder for RGB sequences with temporal attention, a language encoder with bidirectional transformer layers, and a fusion module combining visual and linguistic representations.
19 . The system of claim 10 , wherein at least one policy is trained or fine-tuned using imitation data from one or more of motion capture, teleoperation demonstrations, model-predictive control trajectories, or trajectory optimization outputs.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of and priority to U.S. Provisional Patent Application Nos. 63/715,270, filed Nov. 1, 2024, 63/722,057, filed Nov. 18, 2024, 63/725,279, filed Nov. 26, 2024, 63/753,670, filed Feb. 4, 2025, 63/760,617, filed Feb. 19, 2025, 63/776,429, filed Mar. 24, 2025, 63/801,451, filed May 7, 2025, 63/819,533, filed Jun. 6, 2025, 63/860,403, filed Aug. 8, 2025, 63/860,580, filed Aug. 8, 2025, 63/905,666, filed Oct. 26, 2025 and 63/905,711, filed Oct. 26, 2025, each of which is expressly incorporated by reference herein in its entirety. TECHNICAL FIELD This disclosure relates to systems, methods, and techniques for using reinforcement learning (RL) to train a bipedal action model(s) (BAMs) for use with a general-purpose humanoid robot (e.g., a humanoid, a robot). BACKGROUND Robots are increasingly utilized to automate tasks in various industries, from manufacturing to logistics and beyond. Many conventional robotic systems, such as wheeled robots or stationary articulated arms, are highly specialized for a narrow set of tasks in structured, predictable environments. While effective for their specific purpose, these non-humanoid forms typically lack the versatility and mobility required to perform a diverse array of generalized tasks in environments designed for humans, such as navigating stairs, retrieving items from shelves, or interacting with standard tools and interfaces. The development of humanoid robots aims to address this gap by providing a form factor that can operate in human spaces and perform human-like movements. However, creating a functional, general-purpose humanoid robot presents immense technical challenges. Simply adopting or modifying the structures and kinematic principles from non-humanoid systems is often insufficient, as these theoretical designs are not tethered to the complex realities of design, testing, and manufacturing a stable and capable bipedal system. Conventional control systems for existing robots often face significant limitations, particularly in dynamic and unstructured environments. Many traditional robotic systems rely on pre-programmed responses or behaviors. These systems struggle to adapt in real-time to unforeseen obstacles or changes in their surroundings. Other systems rely on remote processing for complex decision-making, which can introduce significant latency, leading to delays and inappropriate actions in dynamic situations. Some control approaches utilize “binned pose systems” or other discretized action spaces, which can result in discontinuous, “jittery” motions rather than the fluid, continuous whole-body control required for complex manipulation and locomotion. These conventional systems often fail to provide the real-time, context-aware decision-making necessary for robust autonomous operation. Training such complex robots also presents a major hurdle. Training a control policy entirely on a physical robot is often impractical from a time perspective and can be unsafe for the robot and its surroundings. Consequently, training is often performed in a simulated environment. This, however, introduces the well-known “sim-to-real” gap-a discrepancy between the idealized physics of a simulator and the complex, noisy dynamics of the real world. A policy trained exclusively in simulation may fail when deployed on physical hardware. Moreover, common training methods like pure reinforcement learning (RL) can be highly sample-in-efficient, requiring an impractical volume of interactions to learn complex skills. Therefore, there is a significant need in the art for an improved control architecture and training methodology for humanoid robots. SUMMARY The presently disclosed subject matter is directed to a method for controlling a humanoid robot using a hierarchical bipedal action model (BAM). Particularly, the method comprises obtaining a base controller by training in simulation with reinforcement learning. The method includes instantiating an initial BAM including a Gamma model configured to generate intermediate goals, a Beta model configured to translate the intermediate goals into task-space actions, and a Alpha model configured to translate the task-space actions and robot state into motor commands. The method includes deploying the initial BAM such that at least the Alpha model executes on-board the humanoid robot. The method includes causing the humanoid robot to perform an initial task and logging sensor and control data to form a first dataset. The method includes, based on the first dataset, training at least one policy of the BAM to generate a refined BAM. The method includes deploying the refined BAM to control the humanoid robot autonomously. The presently disclosed subject matter is directed to a system for controlling a humanoid robot using a hierarchical bipedal action model (BAM). Particularly, the system comprises a Gamma model configured to receive multimodal inputs and generate in