US-12617083-B2 - Manipulation method learning apparatus, manipulation method learning system, manipulation method learning method, and program

US12617083B2US 12617083 B2US12617083 B2US 12617083B2US-12617083-B2

Abstract

A manipulation method learning apparatus includes: a storage portion that stores an environment model including a device capable of manipulating a target object; a contact state of the device and the target object acquired by using the environment model as an input and calculates an internal reward from a frequency of occurrence of the contact state; an external reward calculation portion that uses a state quantity of the device acquired by using the environment model, a state quantity of the target object acquired by using the environment model, and a contact state quantity as an input, searches a manipulation strategy of the target object, and calculates an external reward for a target task solution; and a search portion that searches while probabilistically selecting a first strategy updated by using the internal reward and a second strategy updated by using the external reward.

Inventors

Naoki Morihira
Akinobu Hayashi

Assignees

HONDA MOTOR CO., LTD.

Dates

Publication Date: 20260505
Application Date: 20231205
Priority Date: 20230224

Claims (16)

1 . A manipulation method learning apparatus that learns a manipulation method of a target object, the manipulation method learning apparatus comprising: a memory that stores an environment model including a device capable of manipulating the target object; and a processor that is coupled to the memory and functions as: a first acquisition portion that acquires a state quantity of the device by using the environment model; a second acquisition portion that acquires a state quantity of the target object by using the environment model; a third acquisition portion that acquires a contact state of the device and the target object by using the environment model; an internal reward calculation portion that uses the contact state as an input and calculates an internal reward which is a reward independent of a target task from a frequency of occurrence of the contact state; an external reward calculation portion that uses the state quantity of the device, the state quantity of the target object, and a contact state quantity of the contact state as an input, searches a manipulation strategy of the target object, and calculates an external reward for a target task solution which is a reward for solving the target task; and a search portion that searches in consideration of the internal reward and the external reward simultaneously while probabilistically selecting a first strategy which is updated by using the internal reward and a second strategy which is updated by using the external reward.
2 . The manipulation method learning apparatus according to claim 1 , wherein the contact state is at least one of information indicating a position which the device has already touched, a contact state estimated from information of a relative position attitude of the device and the target object, information indicating a position which has already been touched on the target object, information representing a force when the device and the target object are in contact with each other, and a detection value detected by a tactile sensor.
3 . The manipulation method learning apparatus according to claim 2 , wherein the search portion collects, at a time of training, data based on the internal reward in advance and performs training.
4 . The manipulation method learning apparatus according to claim 2 , wherein the external reward calculation portion uses, as the input, at least one of a position of the target object, an attitude of the target object, information representing a force when the device and the target object are in contact with each other, a detection value detected by a tactile sensor, an action command that has been previously used at a time of training, and a target contact position.
5 . The manipulation method learning apparatus according to claim 2 , wherein the search portion stores the internal reward, the external reward, an action command to the device, and an action of the device for each search, and updates the first strategy and the second strategy for each predetermined number of the action.
6 . The manipulation method learning apparatus according to claim 2 , wherein the device is a robot hand including two or more finger portions, a haptic sensor is attached to each fingertip of the robot hand, and a tactile sensor is attached to each portion of the robot hand.
7 . The manipulation method learning apparatus according to claim 2 , wherein the internal reward calculation portion estimates a portion that is possibly in contact based on a relative position attitude between a model of a robot hand and a model of the target object and calculates the internal reward.
8 . The manipulation method learning apparatus according to claim 1 , wherein the search portion collects, at a time of training, data based on the internal reward in advance and performs training.
9 . The manipulation method learning apparatus according to claim 1 , wherein the external reward calculation portion uses, as the input, at least one of a position of the target object, an attitude of the target object, information representing a force when the device and the target object are in contact with each other, a detection value detected by a tactile sensor, an action command that has been previously used at a time of training, and a target contact position.
10 . The manipulation method learning apparatus according to claim 1 , wherein the search portion stores the internal reward, the external reward, an action command to the device, and an action of the device for each search, and updates the first strategy and the second strategy for each predetermined number of the action.
11 . The manipulation method learning apparatus according to claim 1 , wherein the device is a robot hand including two or more finger portions, a haptic sensor is attached to each fingertip of the robot hand, and a tactile sensor is attached to each portion of the robot hand.
12 . The manipulation method learning apparatus according to claim 1 , wherein the internal reward calculation portion estimates a portion that is possibly in contact based on a relative position attitude between a model of a robot hand and a model of the target object and calculates the internal reward.
13 . The manipulation method learning apparatus according to claim 1 , wherein the search portion facilitates a search that prefers a rare contact state.
14 . A manipulation method learning system that learns a manipulation method of a target object, the manipulation method learning system comprising: a device capable of manipulating the target object; and a processor that functions as: a first acquisition portion that acquires a state quantity of the device; a second acquisition portion that acquires a state quantity of the target object; a third acquisition portion that acquires a contact state of the device and the target object; an internal reward calculation portion that uses the contact state as an input and calculates an internal reward which is a reward independent of a target task from a frequency of occurrence of the contact state; an external reward calculation portion that uses the state quantity of the device, the state quantity of the target object, and a contact state quantity of the contact state as an input, searches a manipulation strategy of the target object, and calculates an external reward for a target task solution which is a reward for solving the target task; and a search portion that searches in consideration of the internal reward and the external reward simultaneously while probabilistically selecting a first strategy which is updated by using the internal reward and a second strategy which is updated by using the external reward.
15 . A manipulation method learning method by way of a computer of a manipulation method learning apparatus that learns a manipulation method of a target object, the manipulation method learning method including: acquiring, by using an environment model that includes a device capable of manipulating the target object, a state quantity of the device; acquiring a state quantity of the target object by using the environment model; acquiring a contact state of the device and the target object by using the environment model; using the contact state as an input and calculating an internal reward which is a reward independent of a target task from a frequency of occurrence of the contact state; using the state quantity of the device, the state quantity of the target object, and a contact state quantity of the contact state as an input, searching a manipulation strategy of the target object, and calculating an external reward for a target task solution which is a reward for solving the target task; and searching in consideration of the internal reward and the external reward simultaneously while probabilistically selecting a first strategy which is updated by using the internal reward and a second strategy which is updated by using the external reward.
16 . A computer-readable non-transitory recording medium including a program which causes a computer of a manipulation method learning apparatus that learns a manipulation method of a target object to: acquire, by using an environment model that includes a device capable of manipulating the target object, a state quantity of the device; acquire a state quantity of the target object by using the environment model; acquire a contact state of the device and the target object by using the environment model; use the contact state as an input and calculate an internal reward which is a reward independent of a target task from a frequency of occurrence of the contact state; use the state quantity of the device, the state quantity of the target object, and a contact state quantity of the contact state as an input, search a manipulation strategy of the target object, and calculate an external reward for a target task solution which is a reward for solving the target task; and search in consideration of the internal reward and the external reward simultaneously while probabilistically selecting a first strategy which is updated by using the internal reward and a second strategy which is updated by using the external reward.

Description

CROSS-REFERENCE TO RELATED APPLICATION Priority is claimed on Japanese Patent Application No. 2023-026923, filed on Feb. 24, 2023, the contents of which are incorporated herein by reference. BACKGROUND Field of the Invention The present invention relates to a manipulation method learning apparatus, a manipulation method learning system, a manipulation method learning method, and a program. Background For example, with respect to a manipulation operation such as changing of a holding state of an object by a robot hand having multiple fingers, development has been conducted using an operation plan based on a model designed by human. Further, development has been conducted for learning a prediction model from data or acquiring a strategy without a model. For example, a method has been proposed in which by attaching a haptic and/or tactile sensor to a wide range of a hand and gripping through trial and error, a GCN (Graph Convolution Networks) model which is a graph structure is learned (for example, refer to Non-Patent Document 1 (Ken Funabashi, Tomoki Isobe, et al., “Realizing manipulation operation of various objects by multifingered robot hand using GCN and distributed tactile sensor”, the 40th Annual Conference of the Robotics Society of Japan, RSJ2022, 4B2-03, 2022)). In such a method, a strategy is trained using an external reward which is a reward for solving a given task. SUMMARY However, in the related art in which planning is performed using the model designed by human, an applicable manipulation is limited, and due to errors between the model and the real world, robustness at the time of performing is not high. Further, in the technique described in Non-Patent Document 1, when the learning target is a high-dimensional and complex problem, ingenuity such as incorporating a structure that compacts information is required, and when using a tactile sensor, a camera, and a multifingered hand, a highly efficient approach has not been developed yet, and the learning efficiency is poor. Further, in the technique described in Non-Patent Document 1, useful trial and error may not be performed, and learning may not proceed. An aspect of the present invention aims at providing a manipulation method learning apparatus, a manipulation method learning system, a manipulation method learning method, and a program that can perform efficient trial and error when performing learning of object manipulation. A manipulation method learning apparatus according to a first aspect of the present invention is a manipulation method learning apparatus that learns a manipulation method of a target object, the manipulation method learning apparatus including: a storage portion that stores an environment model including a device capable of manipulating the target object: a first acquisition portion that acquires a state quantity of the device by using the environment model: a second acquisition portion that acquires a state quantity of the target object by using the environment model: a third acquisition portion that acquires a contact state of the device and the target object by using the environment model: an internal reward calculation portion that uses the contact state as an input and calculates an internal reward from a frequency of occurrence of the contact state; an external reward calculation portion that uses the state quantity of the device, the state quantity of the target object, and a contact state quantity as an input, searches a manipulation strategy of the target object, and calculates an external reward for a target task solution; and a search portion that searches while probabilistically selecting a first strategy which is updated by using the internal reward and a second strategy which is updated by using the external reward. A second aspect is the manipulation method learning apparatus according to the first aspect described above, wherein the contact state may be at least one of information indicating a position which the device has already touched, a contact state estimated from information of a relative position attitude of the device and the target object, information indicating a position which has already been touched on the target object, information representing a force when the device and the target object are in contact with each other, and a detection value detected by a tactile sensor. A third aspect is the manipulation method learning apparatus according to the first aspect described above, wherein the search portion may collect, at a time of training, data based on the internal reward in advance and perform training. A fourth aspect is the manipulation method learning apparatus according to the first aspect described above, wherein the external reward calculation portion may use, as the input, at least one of a position of the target object, an attitude of the target object, information representing a force when the device and the target object are in contact with each other, a detection value det