US-12619911-B2 - Computing robust policies in offline reinforcement learning

US12619911B2US 12619911 B2US12619911 B2US 12619911B2US-12619911-B2

Abstract

According to one embodiment, a method, computer system, and computer program product for reinforcement learning is provided. The present invention may include training, using an offline dataset, a plurality of diverse reward models, and creating a policy based on an output of the reward models and a robustness operator of the reward models.

Inventors

Radu Marinescu
Parikshit Ram
Djallel Bouneffouf
Tejaswini Pedapati
Paulito Palmes

Assignees

INTERNATIONAL BUSINESS MACHINES CORPORATION

Dates

Publication Date: 20260505
Application Date: 20220617

Claims (20)

1 . A computer implemented method for reinforcement learning, the method comprising: generating, by a processor, a set of diverse reward models, wherein the diverse reward models comprise a plurality of reinforcement learning models differing from each other in at least one of hyper-parameters and machine learning technique; training, by the processor, the diverse reward models in parallel to predict expected rewards for performing any of a plurality of state transitions and a plurality of actions using an offline dataset, wherein the offline dataset comprises the actions and the state transitions as performed by an agent and associated rewards; determining, by the processor using the plurality of diverse reward models, a robustness operator, wherein the robustness operator is a function which takes as its input a vector of the expected rewards predicted by the set of diverse reward models for any of the actions and the state transitions, and expresses the vector as a single number; computing, by the processor, a value function based on the robustness operator that determines how much reward the agent receives for the state transitions and the actions; and creating, by the processor, a policy by utilizing the value function to determine a sequence of the state transitions and the actions resulting in a maximum total reward, wherein the policy expresses the sequence.
2 . The method of claim 1 , further comprising: visualizing a behavior of the policy to a user.
3 . The method of claim 1 , further comprising: adjusting one or more of the hyper-parameters associated with the diverse reward models based on user feedback.
4 . The method of claim 1 , wherein the plurality of diverse reward models are selected from regression models and neural networks.
5 . The method of claim 1 , wherein the robustness operator comprises a tau-percentile over a k-dimensional vector of reward values.
6 . The method of claim 1 , wherein the robustness operator comprises a minimum over a k-dimensional vector of reward values.
7 . The method of claim 1 , wherein the robustness operator comprises a preference relation defined by a convex combination of elements in a k-dimensional reward vector.
8 . A computer system for reinforcement learning, the computer system comprising: one or more processors, one or more computer-readable memories, one or more computer-readable tangible storage medium, and program instructions stored on at least one of the one or more tangible storage medium for execution by at least one of the one or more processors via at least one of the one or more memories, wherein the program instructions, when executed by the one or more processors, cause the computer system to perform a method comprising: generating a set of diverse reward models, wherein the diverse reward models comprise a plurality of reinforcement learning models differing from each other in at least one of hyper-parameters and machine learning technique; training the diverse reward models in parallel to predict expected rewards for performing any of a plurality of state transitions and a plurality of actions using an offline dataset, wherein the offline dataset comprises the actions and the state transitions as performed by an agent and associated rewards; determining, using the plurality of diverse reward models, a robustness operator, wherein the robustness operator is a function which takes as its input a vector of the expected rewards predicted by the set of diverse reward models for any of the actions and the state transitions, and expresses the vector as a single number; computing a value function based on the robustness operator that determines how much reward the agent receives for the state transitions and the actions; and creating a policy by utilizing the value function to determine a sequence of the state transitions and the actions resulting in a maximum total reward, wherein the policy expresses the sequence.
9 . The computer system of claim 8 , further comprising: visualizing a behavior of the policy to a user.
10 . The computer system of claim 8 , further comprising: adjusting one or more of the hyper-parameters associated with the diverse reward models based on user feedback.
11 . The computer system of claim 8 , wherein the plurality of diverse reward models are selected from regression models and neural networks.
12 . The computer system of claim 8 , wherein the robustness operator comprises a tau-percentile over a k-dimensional vector of reward values.
13 . The computer system of claim 8 , wherein the robustness operator comprises a minimum over a k-dimensional vector of reward values.
14 . The computer system of claim 8 , wherein the robustness operator comprises a preference relation defined by a convex combination of elements in a k-dimensional reward vector.
15 . A computer program product for reinforcement learning, the computer program product comprising: one or more computer-readable tangible storage medium and program instructions stored on at least one of the one or more tangible storage medium, the program instructions executable by a processor to cause the processor to perform a method comprising: generating a set of diverse reward models, wherein the diverse reward models comprise a plurality of reinforcement learning models differing from each other in at least one of hyper-parameters and machine learning technique; training the diverse reward models in parallel to predict expected rewards for performing any of a plurality of state transitions and a plurality of actions using an offline dataset, wherein the offline dataset comprises the actions and the state transitions as performed by an agent and associated rewards; determining, using the plurality of diverse reward models, a robustness operator, wherein the robustness operator is a function which takes as its input a vector of the expected rewards predicted by the set of diverse reward models for any of the actions and the state transitions, and expresses the vector as a single number; computing a value function based on the robustness operator that determines how much reward the agent receives for the state transitions and the actions; and creating a policy by utilizing the value function to determine a sequence of the state transitions and the actions resulting in a maximum total reward, wherein the policy expresses the sequence.
16 . The computer program product of claim 15 , further comprising: visualizing a behavior of the policy to a user.
17 . The computer program product of claim 15 , further comprising: adjusting one or more of the hyper-parameters associated with the diverse reward models based on user feedback.
18 . The computer program product of claim 15 , wherein the plurality of diverse reward models are selected from regression models and neural networks.
19 . The computer program product of claim 15 , wherein the robustness operator comprises a tau-percentile over a k-dimensional vector of reward values.
20 . The computer program product of claim 15 , wherein the robustness operator comprises a minimum over a k-dimensional vector of reward values.

Description

BACKGROUND The present invention relates, generally, to the field of computing, and more particularly to machine learning. Machine learning is a technological field of inquiry concerned with creating and improving computerized systems that “learn”; that is to say, systems that leverage data to improve performance on a set of tasks. Machine learning algorithms build a model based on sample data, known as training data, to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as medicine, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks. SUMMARY According to one embodiment, a method, computer system, and computer program product for reinforcement learning is provided. The present invention may include training, using an offline dataset, a plurality of diverse reward models; and creating a policy based on an output of the reward models and a robustness operator of the reward models. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings. The various features of the drawings are not to scale as the illustrations are for clarity in facilitating one skilled in the art in understanding the invention in conjunction with the detailed description. In the drawings: FIG. 1 illustrates an exemplary networked computer environment according to at least one embodiment of the present invention; FIG. 2 is an operational flowchart illustrating an offline learning process according to at least one embodiment of the present invention; FIG. 3 is a system diagram depicting the components of an offline learner system according to at least one embodiment of the present invention; FIG. 4 is a block diagram of internal and external components of computers and servers depicted in FIG. 1 according to at least one embodiment of the present invention; FIG. 5 depicts a cloud computing environment according to an embodiment of the present invention; and FIG. 6 depicts abstraction model layers according to an embodiment of the present invention. DETAILED DESCRIPTION Detailed embodiments of the claimed structures and methods are disclosed herein; however, it can be understood that the disclosed embodiments are merely illustrative of the claimed structures and methods that may be embodied in various forms. This invention may, however, be embodied in many different forms and should not be construed as limited to the exemplary embodiments set forth herein. In the description, details of well-known features and techniques may be omitted to avoid unnecessarily obscuring the presented embodiments. Embodiments of the present invention relate to the field of computing, and more particularly to machine learning. The following described exemplary embodiments provide a system, method, and program product to, among other things, train multiple reinforcement learning models on an offline dataset, and creating a robust policy based on the output of the machine learning models and a robustness operator. Therefore, the present embodiment has the capacity to improve the technical field of machine learning by providing an offline method of creating a policy for a training an agent through reinforcement learning that is robust even with unreliable reward signals, and which can tackle effectively any sequential decision-making problem with multiple objectives and user tradeoffs. As previously described, machine learning is a technological field of inquiry concerned with creating and improving computerized systems that “learn”; that is to say, systems that leverage data to improve performance on a set of tasks. Machine learning algorithms build a model based on sample data, known as training data, to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as medicine, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks. Reinforcement Learning is a learning framework that handles sequential decision-making problems, wherein an ‘agent’ or decision maker learns a policy to optimize a long-term reward by interacting with an unknown environment. At each step, a RL agent obtains evaluative feedback, called reward or cost, about the performance of its action, allowing the agent to improve (maximize or minimize) the performance of subsequent actions. Reinforcement learning differs from supervised learning in that reinforcement learning does not need labelled input-output pairs be presented, nor does it need sub-optimal actions to be explicitly corrected. Furthermore, th