KR-102962740-B1 - Deep Neural Network Structure for Inducing Rational Reinforcement Learning Agent Behavior

KR102962740B1KR 102962740 B1KR102962740 B1KR 102962740B1KR-102962740-B1

Abstract

In the execution phase of a reinforcement learning agent of a deep neural network, an action induced by a user is provided as the input to the reinforcement learning agent. In the reinforcement learning agent, the value obtained by subtracting a parameter value reflecting the coerciveness of the induced action from the output value of the value function of all actions excluding the induced action is calculated as the final value function value. The final action to be performed is determined using the calculated final value function value. The parameter value may be changed to adjust the degree of coerciveness of the induced action. The present invention can effectively respond to situations where a user of a reinforcement learning agent wishes to induce or command the agent to perform a specific action, thereby violating a greedy action decision method in which the agent selects the action with the highest expected cumulative reward among the actions that can be performed in the current state.

Inventors

최호진
남제현
원준희
황예찬
구본홍
윤성열
최형균
이성후
심현우
김보라
이정욱

Assignees

한국과학기술원
주식회사 네비웍스

Dates

Publication Date: 20260508
Application Date: 20220531
Priority Date: 20210531

Claims (8)

As a data processing method performed by a program executed on a computing device, The above data processing method is a method for inducing rational behavioral decisions of a reinforcement learning agent of a deep neural network, and The above program is, In the execution phase after the training of the deep neural network reinforcement learning agent is completed, (a) a step of receiving an action induced by a user as input to the reinforcement learning agent; (b) in the reinforcement learning agent, a step of calculating an immediate and deterministically modified final value function value by subtracting a parameter (λ) value for reflecting the coercion of the induced action from the output value of the value function of all other actions excluding the induced action; and (c) A method for inducing a rational action decision of a reinforcement learning agent of a deep neural network, characterized by including code for causing the processor of the computing device to execute a step of determining a final action to be performed using the calculated final value function value in the reinforcement learning agent.
In claim 1, the reinforcement learning agent determines the final action by using the following equation as a modified value function q' to be used in determining the final action, and by simultaneously considering the induced action and the current observation information and situation. A method for inducing rational action decisions of a reinforcement learning agent of a deep neural network, characterized in that, here, a and s represent an action and a current situation, respectively, λ represents the above parameters, i and j are indices of action a, and q represents a value function that approximates the expected value of a reward that is expected to be obtained when an action is performed in the current situation.
A method for inducing a rational action decision of a reinforcement learning agent of a deep neural network, characterized in that, in claim 1, it further includes the step of changing the parameter value to adjust the degree of coercion of the induced action.
A method for inducing a rational action decision of a reinforcement learning agent of a deep neural network, characterized in that, in claim 3, the step of changing the parameter value includes the step of changing the value of the parameter ( λ ) to a large value so that the output value of the value function of all remaining actions excluding the induced specific action (a i ) becomes smaller, thereby increasing the likelihood that the reinforcement learning agent will select the induced action (a i) as the final action.
A method for inducing rational action decisions of a reinforcement learning agent of a deep neural network, wherein, in claim 1, the deep neural network is configured such that the last layer has a number of nodes equal to the number of actions that the reinforcement learning agent can perform, and each node learns a value function that approximates the expected value of a reward that is expected to be obtained when a specific action is performed in the current situation.
A method for inducing a rational action decision of a reinforcement learning agent of a deep neural network, characterized in that, in claim 1, the induced action includes a plurality of actions.
A computer executable program stored on a computer-readable recording medium for carrying out a method for inducing a reasonable action decision of a reinforcement learning agent of a deep neural network as described in any one of claims 1 to 6.
A computer-readable recording medium having a computer-executable program for performing a method for inducing a reasonable action decision of a reinforcement learning agent of a deep neural network as described in any one of paragraphs 1 to 6.

Description

Method for Inducing Rational Behavior Decisions of a Deep Neural Network Reinforcement Learning Agent {Deep Neural Network Structure for Inducing Rational Reinforcement Learning Agent Behavior} The present invention relates to a reinforcement learning agent in a deep neural network, and more specifically, to a method for inducing action decisions in a reinforcement learning agent. Various studies on deep reinforcement learning, which combines reinforcement learning and deep learning, are continuing to make very successful advancements, being applied not only in the gaming field such as Atari games, Go, and Dota 2, but also in fields such as computer vision, intelligent robots, and natural language processing. Although the detailed structure of reinforcement learning models varies depending on the goals of each model, they all have in common that they learn to make sequential decisions that maximize the expected value of cumulative rewards in environments that can be represented by a Markov Decision Process (MDP). Figure 1 is a conceptual diagram of a general reinforcement learning model that performs learning based on rewards. As illustrated, the reinforcement learning agent determines and executes the next action ( At ) using a policy, which is a function that determines the action to be performed when information about the state ( St ) is given as input. The agent's environment returns a reward (Rt +1 ) based on the agent's current state (St +1 ) and the action performed. The agent learns by evaluating the previously performed action through this reward signal and gradually modifying the policy. Therefore, a well-trained agent will have a policy that returns an action capable of obtaining a high cumulative reward for all possible states. Even after training is complete, the agent decides on the action to perform in a specific state based on a policy, similar to the training phase. At this point, the learned agent's action decision algorithm differs from that of the training phase in the following ways. Specifically, during the training phase, action decision algorithms (e.g., epsilon greedy algorithm) are frequently adopted to allow for trial and error across as many situations as possible, such as performing a random action instead of the action with the highest expected current reward with a certain probability. On the other hand, after training is complete, trial and error is no longer necessary, so the difference lies in the fact that the agent greedily decides on the action that yields the highest expected reward. Therefore, in general cases after training is complete, the reinforcement learning agent always selects and performs the action with the highest expected cumulative reward among the actions available in the current state. However, users may sometimes encounter situations where they want to induce or command a reinforcement learning agent to violate this greedy decision-making method and perform a specific action. Examples of this include changing the driving direction of a self-driving car trained through reinforcement learning or adjusting the play style or behavior of a game agent. Figure 1 is a conceptual diagram of a general reinforcement learning model that proceeds with learning based on rewards. Figure 2 illustrates the input-output structure of a deep neural network when the number of possible actions in deep reinforcement learning is assumed to be N. FIG. 3 is a flowchart illustrating an algorithm for a method of inducing rational action decisions in a reinforcement learning agent of a deep neural network according to an exemplary embodiment of the present invention. Figure 4 shows the process of deriving a new value function for making action decisions by considering observational information and induced actions. Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the attached drawings. Identical components in the drawings are denoted by the same reference numerals, and redundant descriptions of identical components are omitted. In most deep reinforcement learning with a discrete action space, the agent's actions are determined in a manner similar to the process of solving multi-classification problems using deep neural networks. The last layer of the deep neural network contains a number of nodes equal to the number of actions the agent can perform, and each node learns a value function q that approximates the expected value of the reward that can be obtained when performing a specific action a i in the current situation s. The problem can be solved by utilizing the structure of deep reinforcement learning to guide actions by adding a new structure in the process of selecting the action that maximizes the value of the value function q. FIG. 2 illustrates the input/output structure of a deep neural network according to an exemplary embodiment of the present invention. Referring to FIG. 2, the input/output structure of the