US-20260124762-A1 - SYNERGIES BETWEEN PICK AND PLACE: TASK-AWARE GRASP ESTIMATION

US20260124762A1US 20260124762 A1US20260124762 A1US 20260124762A1US-20260124762-A1

Abstract

Systems, methods, and apparatuses for controlling a robot including a manipulator, including: determining three-dimensional (3D) geometry information about a target object based on an image of the target object; determining 3D geometry information about a scene in which the target object is to be placed based on at least one image of the scene; obtaining affordance information by providing the 3D geometry information about the target object and the 3D geometry information about the scene to at least one neural network model; commanding the robot to grasp the target object using the manipulator according to a grasp orientation corresponding to the affordance information; and commanding the robot to position the manipulator according to a placement direction corresponding to the affordance information in order to place the target object at a location in the scene.

Inventors

Nikhil Narsingh Chavan Dafle
Shuran Song
Ibrahim Volkan Isler
Vasileios VASILOPOULOS
Shubham Agrawal
Jinwook Huh
Suveer Garg
Pedro PIACENZA
Isaac Hisano KASAHARA
Kazim Selim ENGIN
Zhanpeng HE

Assignees

SAMSUNG ELECTRONICS CO., LTD.

Dates

Publication Date: 20260507
Application Date: 20260102

Claims (20)

1 . An electronic device for controlling a robot including a manipulator, the electronic device comprising: one or more processors configured to: determine three-dimensional (3D) geometry information about a target object based on an image of the target object; determine 3D geometry information about a scene in which the target object is to be placed based on at least one image of the scene; obtain affordance information by providing the 3D geometry information about the target object and the 3D geometry information about the scene to at least one neural network model; command the robot to grasp the target object using the manipulator according to a grasp orientation corresponding to the affordance information; and command the robot to position the manipulator according to a placement direction corresponding to the affordance information in order to place the target object at a location in the scene.
2 . The electronic device of claim 1 , wherein the one or more processors are further configured to: determine a plurality of candidate grasp orientations for grasping the target object based on the 3D geometry information about the target object; determine a plurality of candidate placement directions for placing the target object in the scene based on the 3D geometry information about the scene; and obtain a plurality of affordance maps by providing information about the plurality of candidate grasp orientations and information about the plurality of candidate placement directions to the at least one neural network model, wherein each affordance map from among the plurality of affordance maps corresponds to a candidate grasp orientation from among the plurality of candidate grasp orientations and a candidate placement direction from among the plurality of candidate placement directions; and select an affordance map from among the plurality of affordance maps, wherein the affordance information corresponds to the selected affordance map.
3 . The electronic device of claim 2 , wherein the at least one neural network model comprises: an object encoder configured to output a plurality of object encodings corresponding to the plurality of candidate grasp orientations based on the 3D geometry information about the target object; a scene encoder configured to output a plurality of scene encodings corresponding to the plurality of candidate placement directions based on the 3D geometry information about the scene; and an affordance decoder configured to output the plurality of affordance maps based on the plurality of object encodings and the plurality of scene encodings.
4 . The electronic device of claim 3 , wherein the object encoder, the scene encoder, and the affordance decoder are jointly trained.
5 . The electronic device of claim 2 , wherein the affordance map comprises a plurality of pixels corresponding to a plurality of affordance values, and wherein each affordance value from among the plurality of affordance values indicates a probability of success for placing the target object.
6 . The electronic device of claim 5 , wherein the affordance map is selected based on the plurality of pixels including a highest affordance value from among all affordance values associated with the plurality of affordance maps.
7 . The electronic device of claim 1 , further comprising at least one camera configured to capture the image of the target object and the at least one image of the scene.
8 . The electronic device of claim 7 , wherein the image of the target object is a depth image, and wherein the at least one image of the scene is a color image.
9 . The electronic device of claim 1 , wherein the one or more processors are configured to command the robot to position the manipulator by computing a proposed trajectory based on the placement direction, and generating a velocity command corresponding to the proposed trajectory.
10 . A method for controlling a robot including a manipulator, the method comprising: determining three-dimensional (3D) geometry information about a target object based on an image of the target object; determining 3D geometry information about a scene in which the target object is to be placed based on at least one image of the scene; obtaining affordance information by providing the 3D geometry information about the target object and the 3D geometry information about the scene to at least one neural network model; commanding the robot to grasp the target object using the manipulator according to a grasp orientation corresponding to the affordance information; and commanding the robot to position the manipulator according to a placement direction corresponding to the affordance information in order to place the target object at a location in the scene.
11 . The method of claim 10 , further comprising: determining a plurality of candidate grasp orientations for grasping the target object based on the 3D geometry information about the target object; determining a plurality of candidate placement directions for placing the target object in the scene based on the 3D geometry information about the scene; and obtaining a plurality of affordance maps by providing information about the plurality of candidate grasp orientations and information about the plurality of candidate placement directions to the at least one neural network model, wherein each affordance map from among the plurality of affordance maps corresponds to a candidate grasp orientation from among the plurality of candidate grasp orientations and a candidate placement direction from among the plurality of candidate placement directions; and selecting an affordance map from among the plurality of affordance maps, wherein the affordance information corresponds to the selected affordance map.
12 . The method of claim 11 , wherein the at least one neural network model comprises: an object encoder configured to output a plurality of object encodings corresponding to the plurality of candidate grasp orientations based on the 3D geometry information about the target object; a scene encoder configured to output a plurality of scene encodings corresponding to the plurality of candidate placement directions based on the 3D geometry information about the scene; and an affordance decoder configured to output the plurality of affordance maps based on the plurality of object encodings and the plurality of scene encodings.
13 . The method of claim 12 , wherein the object encoder, the scene encoder, and the affordance decoder are jointly trained.
14 . The method of claim 11 , wherein the affordance map comprises a plurality of pixels corresponding to a plurality of affordance values, and wherein each affordance value from among the plurality of affordance values indicates a probability of success for placing the target object.
15 . The method of claim 14 , wherein the affordance map is selected based on the plurality of pixels including a highest affordance value from among all affordance values associated with the plurality of affordance maps.
16 . The method of claim 10 , further comprising capturing the image of the target object and the at least one image of the scene.
17 . The method of claim 16 , wherein the image of the target object is a depth image, and wherein the at least one image of the scene is a color image.
18 . The method of claim 10 , wherein the commanding the robot to position the manipulator comprises computing a proposed trajectory based on the placement direction, and generating a velocity command corresponding to the proposed trajectory.
19 . A non-transitory computer-readable medium configured to store instructions which, when executed by at least one processor of a device for controlling a robot including a manipulator, cause the at least one processor to: determine three-dimensional (3D) geometry information about a target object based on an image of the target object; determine 3D geometry information about a scene in which the target object is to be placed based on at least one image of the scene; obtain affordance information by providing the 3D geometry information about the target object and the 3D geometry information about the scene to at least one neural network model; command the robot to grasp the target object using the manipulator according to a grasp orientation corresponding to the affordance information; and command the robot to position the manipulator according to a placement direction corresponding to the affordance information in order to place the target object at a location in the scene.
20 . The non-transitory computer-readable medium of claim 19 , the instructions further cause the at least one processor to: determine a plurality of candidate grasp orientations for grasping the target object based on the 3D geometry information about the target object; determine a plurality of candidate placement directions for placing the target object in the scene based on the 3D geometry information about the scene; and obtain a plurality of affordance maps by providing information about the plurality of candidate grasp orientations and information about the plurality of candidate placement directions to the at least one neural network model, wherein each affordance map from among the plurality of affordance maps corresponds to a candidate grasp orientation from among the plurality of candidate grasp orientations and a candidate placement direction from among the plurality of candidate placement directions; and select an affordance map from among the plurality of affordance maps, wherein the affordance information corresponds to the selected affordance map.

Description

CROSS-REFERENCE TO RELATED APPLICATION(S) This is a Continuation of U.S. application Ser. No. 18/367,827 filed Sep. 13, 2023, which is based on and claims priority under 35 U.S.C. § 119 to U.S. Provisional Patent Application No. 63/406,853 filed on Sep. 15, 2022, U.S. Provisional Patent Application No. 63/450,908 filed on Mar. 8, 2023, and U.S. Provisional Patent Application No. 63/452,620 filed on Mar. 16, 2023, in the U.S. Patent & Trademark Office, the disclosures of which are incorporated herein by reference in their entireties. BACKGROUND 1. Field The disclosure relates to an apparatus and a method for robot motion control, and more particularly, for task-dependent grasp planning for object grasping and placement tasks. 2. Description of Related Art In robot motion control, picking objects and placing objects are two fundamental skills that enable diverse robotic manipulation tasks. However, not all grasps which may be used by a robot to pick an object may be useful for the desired task. For example, a task of placing an object in a particular scene may constrain the suitable grasps on the object. Generally, these two skills have been explored independently. For example, different approaches, ranging from hardware design and physics-based computational tools to some recent learning-based methods, have been explored for object picking, which may refer to generating and facilitating grasps on objects in a scene with six degrees of freedom (6DoF). Separate approaches have been explored for the task of placing a grasped object, while considering the geometry of the object and the environment. Considering object picking and object placing as independent problems may provide conveniences, for example a reduction in the action search space, and build robust algorithms. However, estimating a grasp of an object without considering the downstream task, for example placing the object, can result in grasps which are infeasible for the task. Recent approaches which consider the implications of grasps on the downstream tasks may involve placing and regrasping the object, for example by learning object reorientations which may be used for successful placement. Other approaches may use constrained action space, for example by limiting tasks to two-dimensional top-down placement, or may use expensive supervision, for example expert demonstration on every task. However, such approaches may have limited suitability for 6DoF pick-and-place tasks, or for tasks involving novel objects and novel scenes. SUMMARY One or more embodiments of the present disclosure provide task-aware grasp planning for object grasping and placement tasks. According to an aspect of the disclosure, an electronic device for controlling a robot including a manipulator includes: one or more processors configured to: determine three-dimensional (3D) geometry information about a target object based on an image of the target object; determine 3D geometry information about a scene in which the target object is to be placed based on at least one image of the scene; obtain affordance information by providing the 3D geometry information about the target object and the 3D geometry information about the scene to at least one neural network model; command the robot to grasp the target object using the manipulator according to a grasp orientation corresponding to the affordance information; and command the robot to position the manipulator according to a placement direction corresponding to the affordance information in order to place the target object at a location in the scene. The one or more processors may be further configured to: determine a plurality of candidate grasp orientations for grasping the target object based on the 3D geometry information about the target object; determine a plurality of candidate placement directions for placing the target object in the scene based on the 3D geometry information about the scene; obtain a plurality of affordance maps by providing information about the plurality of candidate grasp orientations and information about the plurality of candidate placement directions to the at least one neural network model, wherein each affordance map from among the plurality of affordance maps corresponds to a candidate grasp orientation from among the plurality of candidate grasp orientations and a candidate placement direction from among the plurality of candidate placement directions; and select an affordance map from among the plurality of affordance maps, wherein the affordance information corresponds to the selected affordance map. The at least one neural network model may include: an object encoder configured to output a plurality of object encodings corresponding to the plurality of candidate grasp orientations based on the 3D geometry information about the target object; a scene encoder configured to output a plurality of scene encodings corresponding to the plurality of candidate placement directions based on the 3D geomet