US-12619912-B2 - Device, computer program and computer-implemented method for machine learning

US12619912B2US 12619912 B2US12619912 B2US 12619912B2US-12619912-B2

Abstract

A device, computer program and computer-implemented method for machine learning. The method comprises providing a task comprising an action space of a multi-armed bandit problem or a contextual bandit problem and a distribution over rewards that is conditioned on actions, providing a hyperprior, wherein the hyperprior is a distribution over the action space, determining, depending on the hyperprior, a hyperposterior for that a lower bound for an expected reward on future bandit tasks has as large a value as possible, when using priors sampled from the hyperposterior, and wherein the hyperposterior is a distribution over the action space.

Inventors

Hamish Flynn
David Reeb
Jan Peters
Melih Kandemir

Assignees

ROBERT BOSCH GMBH

Dates

Publication Date: 20260505
Application Date: 20220706
Priority Date: 20210722

Claims (9)

1 . A computer-implemented method for machine learning, the method comprising the following steps: providing a task including an action space of a multi-armed bandit problem or a contextual bandit problem and a distribution over rewards that is conditioned on actions; providing a hyperprior, wherein the hyperprior is a distribution over the action space; initializing a posterior with a prior; determining from a set of behavior policies of the task a behavior policy that is associated with the task posterior, wherein the behavior policy includes a distribution over actions with a probability mass; sampling or selecting randomly from the probability mass an action; sampling a reward depending on the action from the distribution over rewards; determining a task dataset that includes the action and the reward; updating the posterior to include the determined task dataset; determining, depending on the hyperprior, a hyperposterior so that a lower bound for an expected reward on future bandit tasks is maximized, when using priors sampled from the hyperposterior, wherein the hyperposterior is a distribution over the action space; processing sensor data including digital image data or audio data, depending on a prior sampled from the hyperposterior, for classifying the sensor data; and (i) detecting presence of objects in the sensor data, or (ii) performing a semantic segmentation on the sensor data, or (iii) determining a measure for robustness of the machine learning including a probability that an expected error on a next task is not above a predetermined value, when sampling priors from the hyperposterior, or (iv) detecting an anomaly in sensor data depending on a prior sampled from the hyperposterior, or (v) learning a policy for controlling a physical system and determining a control signal for controlling the physical system depending on a prior sampled from the hyperposterior.
2 . The method according to claim 1 , further comprising: determining the hyperposterior in a number of iterations; and sampling a prior of an iteration from the hyperposterior of a previous iteration.
3 . The method according to claim 2 , further comprising: sampling the task of the iteration from a distribution over tasks.
4 . The method according to claim 1 , further comprising: initializing the hyperposterior with the hyperprior.
5 . The method according to claim 1 , wherein the providing the task includes providing the task including a state space and a distribution over initial states, and wherein the method further comprises sampling or selecting randomly an initial state from the distribution over initial states, and wherein the distribution over rewards is conditioned on the actions and states of the state space.
6 . The method according to claim 1 , the method further comprising: initializing the task dataset with an empty set and then updating the posterior in a predetermined number of rounds.
7 . The method according to claim 1 , wherein the determining of the hyperposterior includes determining an approximation of an expected reward depending on a Kullback-Leibler divergence of the hyperposterior and the hyperprior.
8 . A device for machine learning, the device configured to: provide a task including an action space of a multi-armed bandit problem or a contextual bandit problem and a distribution over rewards that is conditioned on actions; provide a hyperprior, wherein the hyperprior is a distribution over the action space; initialize a posterior with a prior; determine from a set of behavior policies of the task a behavior policy that is associated with the task posterior, wherein the behavior policy includes a distribution over actions with a probability mass; sample or select randomly from the probability mass an action; sample a reward depending on the action from the distribution over rewards; determine a task dataset that includes the action and the reward; update the posterior to include the determined task dataset; determine, depending on the hyperprior, a hyperposterior so that a lower bound for an expected reward on future bandit tasks is maximized, when using priors sampled from the hyperposterior, wherein the hyperposterior is a distribution over the action space; process sensor data including digital image data or audio data, depending on a prior sampled from the hyperposterior, for classifying the sensor data; and (i) detect presence of objects in the sensor data, or (ii) perform a semantic segmentation on the sensor data, or (iii) determine a measure for robustness of the machine learning including a probability that an expected error on a next task is not above a predetermined value, when sampling priors from the hyperposterior, or (iv) detect an anomaly in sensor data depending on a prior sampled from the hyperposterior, or (v) learn a policy for controlling a physical system and determining a control signal for controlling the physical system depending on a prior sampled from the hyperposterior.
9 . A non-transitory computer-readable medium on which is stored a computer program including computer-readable instructions for machine learning, the instructions, when executed by a computer, causing the computer to perform the following steps: providing a task including an action space of a multi-armed bandit problem or a contextual bandit problem and a distribution over rewards that is conditioned on actions; providing a hyperprior, wherein the hyperprior is a distribution over the action space; initializing a posterior with a prior; determining from a set of behavior policies of the task a behavior policy that is associated with the task posterior, wherein the behavior policy includes a distribution over actions with a probability mass; sampling or selecting randomly from the probability mass an action; sampling a reward depending on the action from the distribution over rewards; determining a task dataset that includes the action and the reward; updating the posterior to include the determined task dataset; determining, depending on the hyperprior, a hyperposterior so that a lower bound for an expected reward on future bandit tasks is maximized, when using priors sampled from the hyperposterior, wherein the hyperposterior is a distribution over the action space; processing sensor data including digital image data or audio data, depending on a prior sampled from the hyperposterior, for classifying the sensor data; and (i) detecting presence of objects in the sensor data, or (ii) performing a semantic segmentation on the sensor data, or (iii) determining a measure for robustness of the machine learning including a probability that an expected error on a next task is not above a predetermined value, when sampling priors from the hyperposterior, or (iv) detecting an anomaly in sensor data depending on a prior sampled from the hyperposterior, or (v) learning a policy for controlling a physical system and determining a control signal for controlling the physical system depending on a prior sampled from the hyperposterior.

Description

CROSS REFERENCE The present application claims the benefit under 35 U.S.C. § 119 of German Patent Application No. DE 10 2021 207 868.0 filed on Jul. 22, 2021, which is expressly incorporated herein by reference in its entirety. BACKGROUND INFORMATION The present invention relates to a device, a computer program and a computer-implemented method for machine learning. “A PAC-Bayesian bound for Lifelong Learning,” by Anastasia Pentina, Christoph H. Lampert, arXiv: 1311.2838, describes aspects of a lifelong learning setting for machine learning. SUMMARY A computer-implemented method, a device and a computer program according to the present invention provide an improved machine learning, in particular based on an objective function for training a lifelong learning system. In accordance with an example embodiment of the present invention, the computer-implemented method for machine learning comprises providing a task comprising an action space of a multi-armed bandit problem or a contextual bandit problem and a distribution over rewards that is conditioned on actions, providing a hyperprior, wherein the hyperprior is a distribution over the action space, determining, depending on the hyperprior, a hyperposterior for that a lower bound for an expected reward on future bandit tasks has as large a value as possible, when using priors sampled from the hyperposterior, and wherein the hyperposterior is a distribution over the action space. This means, that the tasks are either a multi-armed bandit or contextual bandit problem. The lower bound is a PAC-Bayesian generalisation bound that lower bounds the expected reward on the future bandit tasks when using priors sampled from a hyperposterior. The lower bound is used as objective function for a lifelong learning, in which after observing a task, the hyperposterior is updated such that it approximately maximises the lower bound. The lower bound contains observable quantities and as a result can be computed using training data from observed tasks. This way, prior knowledge from a set of observed multi-armed bandit or contextual bandit tasks is transferred to new multi-armed bandit or contextual bandit tasks. In accordance with an example embodiment of the present invention, determining the hyperposterior may comprise determining the hyperposterior that maximises the lower bound for the expected reward. The hyperposterior that maximises the lower bound is expected to assign high probability to priors that result in low error on future tasks. The method can be used to transfer prior knowledge from a set of observed multi-armed bandit or contextual bandit tasks to new multi-armed bandit or contextual bandit tasks. Many problems can be represented as either multi-armed bandit problems or contextual bandit problems. Therefore, the method can be used to transfer prior knowledge to many types of tasks. In accordance with an example embodiment of the present invention, the method may comprise processing sensor data, in particular digital image data or audio data, depending on a prior sampled from the hyperposterior, in particular for classifying the sensor data, detecting the presence of objects in the sensor data or performing a semantic segmentation on the sensor data, or determining a measure for robustness of the machine learning, in particular a probability that an expected error on a next task is not above a predetermined value, when sampling priors from the hyperposterior, or detecting an anomaly in sensor data depending on a prior sampled from the hyperposterior, or learning a policy for controlling a physical system and determining a control signal for controlling the physical system depending on a prior sampled from the hyperposterior. Therefore, the method is used to transfer prior knowledge to new problems, in particular new classification problems, new robustness problems, new anomaly detection problems, and/or new control problems. The lower bound is also a measure for robustness of the machine learning. The hyperposterior also provides information about an expected error on the next task, when priors are sampled from the hyperposterior. The expected error is for example not above a certain value, when priors are sampled from a certain hyperposterior. In accordance with an example embodiment of the present invention, the method may comprise determining the hyperposterior in a number of iterations, and sampling the prior of an iteration from the hyperposterior of a previous iteration. Thus, in an iteration of the machine learning, prior knowledge is transferred from earlier iterations. In accordance with an example embodiment of the present invention, the method may comprise sampling the task of the iteration from a distribution over tasks. Processing an in particular large number of different tasks improves the result of the machine learning. In accordance with an example embodiment of the present invention, the method preferably comprises initializing the hyperpost