US-12619853-B2 - Decision-making device, unmanned system, decision-making method, and program

US12619853B2US 12619853 B2US12619853 B2US 12619853B2US-12619853-B2

Abstract

A decision-making device ( 2 ) comprising: an action selection unit ( 200 ) for selecting one of a plurality of actions that can be taken in a first state so that an environment performs the selected action; a state acquisition unit ( 201 ) for acquiring a second state indicating the state that follows the execution of the action; a reward acquisition unit ( 202 ) for acquiring a reward serving as an indicator for indicating whether or not the second state is desirable; and a storage processing unit ( 203 ) whereby experience data in which the first state, the action, the second state, and the reward are associated is stored in a storage unit ( 21 ) associated with the action, the storage unit ( 21 ) being one of a plurality of storage units.

Inventors

Yusuke HAZUI
Yasuo Fujishima
Natsuki MATSUNAMI

Assignees

MITSUBISHI HEAVY INDUSTRIES, LTD.

Dates

Publication Date: 20260505
Application Date: 20190617
Priority Date: 20180628

Claims (12)

1 . A decision-making device that is used in an unmanned system having a machine that operates in an unmanned manner, the device comprising: a hardware processor; and a plurality of storage units, each of which is assigned to only one corresponding action of a plurality of actions that are performed by the machine, wherein the hardware processor functions as: an action selection unit that selects any one of the plurality of actions allowed to be taken in a first state so that an environment performs the selected action; a state acquisition unit that acquires a second state indicating a state after the selected action is performed; a reward acquisition unit that acquires a reward as an index indicating whether the second state is desirable; and a storage processing unit storing experience data in a storage unit i) that is one of the plurality of storage units and ii) that is assigned to a specific action determined to be same as the selected action, wherein the specific action belonging to one of the plurality of actions, the experience data being data in which the first state, the selected action, the second state, and the reward are associated with each other, wherein the decision-making device functions as an agent that observes what state the machine has transitioned to due to the selected action, and performs reinforcement learning to determine an action in response to the observed state, wherein respective experience data is stored in a storage unit of each of the plurality of storage units, each of which is configured to receive different experience data associated with a different selected action among the plurality of storage units, thus, a situation is configured to be prevented in which experience data of a first selected action are overwritten by experience data of a second selected action that is performed more frequently than the first selected action in each of the plurality of storage units, and wherein the hardware processor further functions as a deletion processing unit that, in response to an amount of experience data stored in a storage unit of the plurality of storage units reaching an upper limit value, deletes experience data used most in learning.
2 . The decision-making device according to claim 1 , wherein the hardware processor further functions as: a deletion processing unit that, when an amount of experience data stored in a storage unit of the plurality of storage units reaches an upper limit value, deletes the oldest experience data.
3 . The decision-making device according to claim 1 , wherein the hardware processor further functions as: a learning unit that randomly selects and extracts a predetermined number of pieces of experience data from each of the plurality of storage units as learning data, and updates a learning model for estimating an action having a highest value in the first state based on the learning data.
4 . The decision-making device according to claim 3 , wherein the learning unit selects and extracts the same number of pieces of experience data from each of the plurality of storage units as the learning data.
5 . The decision-making device according to claim 3 , wherein, when a number of pieces of experience data stored in a storage unit of the plurality of storage units does not satisfy the predetermined number, the learning unit extracts all the pieces of experience data as the learning data.
6 . An unmanned system comprising: the decision-making device according to claim 1 .
7 . The decision-making device according to claim 1 , wherein the reward acquired by the reward acquisition unit is a continuous value based on a predetermined reward calculation expression.
8 . A decision-making method of a decision-making device that is used in an unmanned system having a machine that operates in an unmanned manner, the device comprising a hardware processor and a plurality of storage units, each of which is assigned to only one corresponding action of a plurality of actions that are performed by the machine, the method comprising: a step of selecting any one of the plurality of actions allowed to be taken in a first state so that an environment performs the selected action; a step of acquiring a second state indicating a state after the selected action is performed; a step of acquiring a reward as an index indicating whether the second state is desirable; and a step of storing experience data in a storage unit i) that is one of the plurality of storage units and ii) that is assigned to a specific action determined to be same as the selected action, wherein the specific action belonging to one of the plurality of actions, the experience data being data in which the first state, the selected action, the second state, and the reward are associated with each other, wherein the method further comprising a step of observing what state the machine has transitioned to due to the selected action, and performing reinforcement learning to determine an action in response to the observed state, wherein respective experience data is stored in a storage unit of each of the plurality of storage units, each of which is configured to receive different experience data associated with a different selected action among the plurality of storage units, thus, a situation is configured to be prevented in which experience data of a first selected action are overwritten by experience data of a second selected action that is more frequently performed than the first selected action in each of the plurality of storage units, and wherein the method further comprising a step of deleting, in response to an amount of experience data stored in a storage unit of the plurality of storage units reaching an upper limit value, experience data used most in learning.
9 . The decision-making method according to claim 8 , wherein the reward acquired in the step of acquiring the reward is a continuous value based on a predetermined reward calculation expression.
10 . A non-transitory computer-readable medium that stores a program causing a computer of a decision-making device that is used in an unmanned system having a machine that operates in an unmanned manner to function, the device comprising a hardware processor and a plurality of storage units, each of which is assigned to only one corresponding action of a plurality of actions that are performed by the machine, the program causing the computer to execute: a step of selecting any one of the plurality of actions allowed to be taken in a first state so that an environment performs the selected action; a step of acquiring a second state indicating a state after the selected action is performed; a step of acquiring a reward as an index indicating whether the second state is desirable; and a step of storing experience data in a storage unit i) that is one of the plurality of storage units and ii) that is assigned to a specific action determined to be same as the selected action, wherein the specific action belonging to one of the plurality of actions, the experience data being data in which the first state, the selected action, the second state, and the reward are associated with each other, wherein the program further causes the computer to execute a step of observing what state the machine has transitioned to due to the selected action, and performing reinforcement learning to determine an action in response to the observed state, wherein respective experience data is stored in a storage unit of each of the plurality of storage units, each of which is configured to receive different experience data associated with a different selected action among the plurality of storage units, thus, a situation is configured to be prevented in which experience data of a first selected action are overwritten by experience data of a second selected action that is more frequently performed than the first selected action in each of the plurality of storage units, and wherein the program further causes the computer to execute a step of deleting, in response to an amount of experience data stored in a storage unit of the plurality of storage units reaching an upper limit value, experience data used most in learning.
11 . The non-transitory computer-readable medium according to claim 10 , wherein the reward acquired in the step of acquiring the reward is a continuous value based on a predetermined reward calculation expression.
12 . A decision-making device that is used in an unmanned system having a machine that operates in an unmanned manner, the device comprising: a hardware processor; and a plurality of storage units each of which is assigned to only one corresponding action of a plurality of actions that are performed by the machine, wherein the hardware processor functions as: an action selection unit that selects any one of the plurality of actions allowed to be taken in a first state so that an environment performs the selected action; a state acquisition unit that acquires a second state indicating a state after the selected action is performed; a reward acquisition unit that acquires a reward as an index indicating whether the second state is desirable; and a storage processing unit storing experience data in a storage unit i) that is one of the plurality of storage units and ii) that is assigned to a specific action determined to be same as the selected action, wherein the specific action belonging to one of the plurality of actions, the experience data being data in which the first state, the selected action, the second state, and the reward are associated with each other, wherein the decision-making device functions as an agent that observes what state the machine has transitioned to due to the selected action, and performs reinforcement learning to determine an action in response to the observed state, wherein respective experience data is stored in a storage unit of each of the plurality of storage units, each of which is configured to receive different experience data associated with a different selected action among the plurality of storage units, thus, a situation is configured to be prevented in which experience data of a first selected action are overwritten by experience data of a second selected action that is performed more frequently than the first selected action in each of the plurality of storage units, wherein the hardware processor further functions as a learning unit that selects and extracts experience data from the plurality of storage units as learning data, and updates a learning model for estimating an action having a highest value in the first state based on the learning data; and wherein, in response to a number of pieces of experience data stored in a storage unit of the plurality of storage units not satisfying a predetermined number of pieces of experience data from each of the plurality of storage units, the learning unit extracts all the pieces of experience data as the learning data.

Description

TECHNICAL FIELD The present disclosure relates to a decision-making device, an unmanned system, a decision-making method, and a program. The present application claims priority based on Japanese Patent Application No. 2018-123527 filed in Japan on Jun. 28, 2018, the contents of which are incorporated herein by reference. BACKGROUND ART In recent years, machine learning using deep learning, which has a high computational load, has become widespread due to high performance of computers and the like. For example, as a technology obtained by combining deep learning and reinforcement learning, there is a technology called Deep Q Network (DQN) that learns an optimal action in a certain state of a control target (environment). In the DQN, an agent being a learning subject observes what state the state has transitioned to by an action performed when the environment is in a certain state, and acquires a reward for this state transition. The agent collects many pieces of experience data in which the state before the transition, the action, the state after the transition, and the reward are associated with each other, and approximates an action value function of obtaining a value of the action in a certain stage based on the experience data with a multilayer neutral network. In the DQN, as described above, the action value function for estimating the optimal action (can be expected to obtain the most reward) in various states is learned and updated based on the experience data. Note that, since experience data that is continuous in time series has a strong correlation, for example, if an agent performs learning using only new stored experience data, there is a possibility that the estimation accuracy for old experience data is degraded and the convergence of the action value function is deteriorated. Therefore, in order to suppress the bias of the data used for learning, a technology called experience replay in which learning data is randomly selected from the experience data previously accumulated and then learning is performed has been considered. If the storage area reaches the upper limit, the experience data accumulated in the experience replay is deleted in chronological order by First In First Out (FIFO). In such a manner, similar data that is close to the current time in the time series is left in the storage area. As a method of eliminating such a bias of the experience data, for example, PTL 1 discloses a method of calculating a uniqueness parameter and deleting experience data having a high similarity with other pieces of experience data based on the uniqueness parameter. The uniqueness parameter indicates how different each piece of accumulated experience data is from other pieces of experience data. CITATION LIST Patent Literature [PTL 1] Japanese Unexamined Patent Application Publication No. 2018-005739 SUMMARY OF INVENTION Technical Problem However, in the method in the related art, for example, in a case where the number of actions is limited, some actions may not be performed even though the actions are randomly selected. In this case, the accumulated experience data will also be biased. In addition, since the experience data includes various parameters, it may be difficult to select an appropriate uniqueness parameter. As a result, it is not possible to sufficiently eliminate the bias of the experience data accumulated in the storage area, and for example, there is a possibility that the learning opportunities are reduced and the learning accuracy is reduced, for an action having a small number of pieces of experience data. At least one embodiment of the present invention provides a decision-making device, an unmanned system, a decision-making method, and a program in which it is possible to suppress the bias of experience data. Solution to Problem According to a first aspect of the present invention, a decision-making device includes an action selection unit that selects any one of a plurality of actions allowed to be taken in a first state so that an environment performs the selected action, a state acquisition unit that acquires a second state indicating a state after the action is performed, a reward acquisition unit that acquires a reward as an index indicating whether the second state is desirable, and a storage processing unit that stores experience data in a storage unit associated with the action among a plurality of storage units, the experience data being data in which the first state, the action, the second state, and the reward are associated with each other. In this case, the decision-making device can prevent an occurrence of a situation in which the experience data stored in the storage unit is biased depending on the degree of a selection frequency of the action. According to a second aspect of the present invention, a decision-making device includes an action selection unit that selects any one of a plurality of actions allowed to be taken in a first state so that an environment per