US-12623350-B2 - Universal manipulation interface

US12623350B2US 12623350 B2US12623350 B2US 12623350B2US-12623350-B2

Abstract

Systems, methods, and other embodiments described herein relate to an interface for training a visuomotor policy to be device agnostic and controlling a robotic device using the visuomotor policy. In one embodiment, a method includes collecting sensor data about a robotic manipulator within an environment. The method includes pre-processing the sensor data to compensate for an observation latency of at least one sensor associated with the robotic manipulator. The method includes generating actions for the robotic manipulator to perform a task according to the sensor data and a visuomotor policy. The method includes controlling the robotic manipulator, including compensating for execution latency of the robotic manipulator in performing the actions.

Inventors

Cheng Chi
Zhenjia Xu
Eric A. Cousineau
Siyuan Feng
Benjamin Burchfiel
Russell Louis Tedrake
Shuran Song
Chu Er Pan

Assignees

Toyota Research Institute, Inc.

Dates

Publication Date: 20260512
Application Date: 20240603

Claims (20)

1 . An interface system, comprising: one or more processors; a memory communicably coupled to the one or more processors and storing instructions that cause the one or more processors to: collect sensor data about a robotic manipulator within an environment; pre-process the sensor data to compensate for an observation latency of at least one sensor associated with the robotic manipulator, including instructions to align observation latencies between the at least one sensor, an end-effector pose, and a gripper width according to measured latencies; generate actions for the robotic manipulator to perform a task according to the sensor data and a visuomotor policy; and control the robotic manipulator, including compensating for execution latency of the robotic manipulator in performing the actions.
2 . The interface system of claim 1 , wherein the visuomotor policy is device agnostic, wherein the instructions to pre-process the sensor data and compensate for the execution latency customize the actions for the robotic manipulator, and wherein the instructions to collect the sensor data include instructions to acquire at least images of the environment.
3 . The interface system of claim 1 , wherein the instructions to to align observation latencies include instructions to align observation streams of data according to a stream with a highest latency, and wherein the observation latencies are times associated with the at least one sensor, an end effector pose, and a gripper performing associated functions.
4 . The interface system of claim 1 , wherein the instructions to generate the actions include instructions to generate the actions as a sequence of synchronized end-effector poses and gripper widths, wherein the instructions to compensate for the execution latency include instructions to adjust timing of execution of the actions using the execution latency to ensure the robotic manipulator reaches the synchronized end-effector poses and the gripper widths at planned times; and wherein the actions are defined according to relative trajectories in relation to an end-effector.
5 . The interface system of claim 1 , wherein the instructions further include instructions to collect demonstration data from a handheld gripper while the handheld gripper performs the task, including instructions to determine, using simultaneous localization and mapping (SLAM), 6 degree of freedom (DOF) poses for an end-effector of the handheld gripper according to a series of images and IMU measurements from the handheld gripper.
6 . The interface system of claim 5 , wherein the instructions further include instructions to filter demonstration data to find a subset of trajectory that is agnostic to the handheld gripper, wherein the instructions to filter include instructions filter according to a kinematics and dynamics feasibility filtering.
7 . The interface system of claim 5 , wherein the instructions further include instructions to train the visuomotor policy to learn the task according to the demonstration data that has been filtered to be agnostic to the handheld gripper.
8 . The interface system of claim 1 , wherein the robotic manipulator is a different configuration of a robotic device from a handheld gripper that is used to acquire demonstration data for training the visuomotor policy.
9 . A non-transitory computer-readable medium including instructions that, when executed by one or more processors, cause the one or more processors to: collect sensor data about a robotic manipulator within an environment; pre-process the sensor data to compensate for an observation latency of at least one sensor associated with the robotic manipulator, including instructions to align observation latencies between the at least one sensor, an end-effector pose, and a gripper width according to measured latencies; generate actions for the robotic manipulator to perform a task according to the sensor data and a visuomotor policy; and control the robotic manipulator, including compensating for execution latency of the robotic manipulator in performing the actions.
10 . The non-transitory computer-readable medium of claim 9 , wherein the visuomotor policy is device agnostic, wherein the instructions to pre-process the sensor data and compensate for the execution latency customize the actions for the robotic manipulator, and wherein the instructions to collect the sensor data include instructions to acquire at least images of the environment.
11 . The non-transitory computer-readable medium of claim 9 , wherein the observation latencies are times associated with the at least one sensor, an end effector pose, and a gripper performing associated functions.
12 . The non-transitory computer-readable medium of claim 9 , wherein the instructions to generate the actions include instructions to generate the actions as a sequence of synchronized end-effector poses and gripper widths, wherein the instructions to compensate for the execution latency include instructions to adjust timing of execution of the actions using the execution latency to ensure the robotic manipulator reaches the synchronized end-effector poses and the gripper widths at planned times; and wherein the actions are defined according to relative trajectories in relation to an end-effector.
13 . The non-transitory computer-readable medium of claim 9 , wherein the instructions further include instructions to collect demonstration data from a handheld gripper while the handheld gripper performs the task, including instructions to determine, using simultaneous localization and mapping (SLAM), 6 degrees of freedom (DOF) poses for an end-effector of the handheld gripper according to a series of images and IMU measurements from the handheld gripper.
14 . A method, comprising: collecting sensor data about a robotic manipulator within an environment; pre-processing the sensor data to compensate for an observation latency of at least one sensor associated with the robotic manipulator, including aligning observation latencies between the at least one sensor, an end-effector pose, and a gripper width according to measured latencies; generating actions for the robotic manipulator to perform a task according to the sensor data and a visuomotor policy; and controlling the robotic manipulator, including compensating for execution latency of the robotic manipulator in performing the actions.
15 . The method of claim 14 , wherein the visuomotor policy is device agnostic, wherein pre-processing the sensor data and compensating for the execution latency customizes the actions for the robotic manipulator, and wherein collecting the sensor data includes acquiring at least images of the environment.
16 . The method of claim 14 , wherein the observation latencies are times associated with the at least one sensor, an end effector pose, and a gripper performing associated functions.
17 . The method of claim 14 , wherein generating the actions includes generating the actions as a sequence of synchronized end-effector poses and gripper widths, wherein compensating for the execution latency includes adjusting timing of execution of the actions using the execution latency to ensure the robotic manipulator reaches the synchronized end-effector poses and the gripper widths at desired times; and wherein the actions are defined according to relative trajectories in relation to an end-effector.
18 . The method of claim 14 , further comprising: collecting demonstration data from a handheld gripper while the handheld gripper performs the task, including determining, using simultaneous localization and mapping (SLAM), 6 degree of freedom (DOF) poses for an end-effector of the handheld gripper according to a series of images and IMU measurements from the handheld gripper.
19 . The method of claim 18 , further comprising: filtering demonstration data to find a subset of trajectory that is agnostic to the handheld gripper, wherein filtering is a kinematics and dynamics feasibility filtering.
20 . The method of claim 18 , further comprising: training the visuomotor policy to learn the task according to the demonstration data that has been filtered to be agnostic to the handheld gripper.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims benefit of U.S. Provisional Application No. 63/548,607, filed on, Feb. 1, 2024, which is herein incorporated by reference in its entirety. TECHNICAL FIELD The subject matter described herein relates, in general, to systems and methods for controlling robotic devices to perform tasks and, more particularly, to training a visuomotor policy in a way that is device-agnostic. BACKGROUND Training devices, such as robotic arms with manipulators, to perform tasks can be complex. For example, various approaches attempt to train a policy for a device through demonstrating complex manipulation skills using targeted lab-based datasets generated through manual operation by teleoperation or leveraging unstructured videos of a person performing the task. However, these approaches have many drawbacks, including high setup costs (e.g., expert operator and teleoperation hardware), large embodiment gaps (e.g., between humans and robots), and so on. Moreover, using one type of device to acquire information for training a policy for use by another type of device also suffers from difficulties. For example, the data collected in this way does not generally transfer and results in a lack of action diversity where examples are constrained to simple actions or quasi-static pick-and-place actions due to issues with insufficient visual context, action precision, latency discrepancies, and insufficient policy representation. Therefore, accurately training a robotic device to perform different tasks remains difficult. SUMMARY Example systems and methods relate to an interface for training a visuomotor policy to be device agnostic and controlling a robotic device using the visuomotor policy. As noted previously, various approaches for acquiring training data and controlling a robotic device using a policy trained on the data encounter difficulties that result in high costs and other issues. For example, because of the way in which the data is collected there may be large embodiment gaps and/or the learnable actions may be constrained to limited complexity. In general, this leads to limited usefulness in attempting to leverage such data for training, thereby leaving manual or other costly options to configure a device. Therefore, in various arrangements, an inventive system implements a handheld gripper to acquire demonstration data about a task and train a visuomotor policy in a way that the policy is device agnostic and further overcomes the other noted difficulties. For example, the system acquires the demonstration data in a format that relates to the gripper itself as opposed to a statically mounted camera within an environment in which the gripper is in use. That is, the sensors that perceive/generate the demonstration data are integrated with the gripper in order to observe an end-effector and gripper position. This provides a direct perception of the actions of the device without consideration to device-specific elements or consideration of the environment more broadly that otherwise need to be translated on a per-device basis. The sensors can include, for example, an inertial measurement unit (IMU), a camera, position sensors for the gripper, and so on. Accordingly, the system collects the demonstration data of the use of the handheld gripper performing a particular task, such as folding clothing, washing dishes, tossing an object, rearranging objects, picking and placing an object, and so on, from the perspective of the device itself. Once collected, the demonstration data can then be processed into a sequence of synchronized information that pairs actions with observations and generally includes, in at least one approach, images, 6-degree of freedom (6-DOF) end-effector pose, gripper width, gripper velocity, etc. The system may then perform kinematic filtering on the demonstration data to transform the demonstration data into a set of trajectories that is device agnostic. Thereafter, the system trains a visuomotor policy to generate robot behaviors via a conditional denoising diffusion process on a robot action space. The resulting visuomotor policy is, as mentioned, device agnostic and can be transferred to other types of robotic manipulators. For example, the system can then implement the visuomotor policy with a robotic manipulator to perform the learned task. In general, the system collects sensor data that includes at least images and gripper positions. Because the latencies between the handheld gripper and the robotic manipulator are distinct in both observation latencies and execution latencies, the system compensates for these latencies to ensure accurate control. In at least one approach, the system pre-processes the sensor data to align different aspects of the data according to the observation latency. In one example, the system aligns the data according to an element of the stream with a highest latency (e.g., a camera). That is, different observation