CN-114641778-B - Gated linear context gaming machine

CN114641778BCN 114641778 BCN114641778 BCN 114641778BCN-114641778-B

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for selecting an action in response to each context in a sequence of context inputs. One of the methods includes maintaining data specifying a respective gated linear network corresponding to each of the plurality of actions, for each context in a sequence of contexts, processing the context using the gated linear network corresponding to the action to generate a predictive probability for each action, generating an action score for the action based at least on the predictive probability, and selecting an action to perform in response to the context based on the action scores.

Inventors

Ellen Sezener
Joel William Weines
Marcus Huttle
WANG JIANAN
DAVID BADEN

Assignees

渊慧科技有限公司

Dates

Publication Date: 20260512
Application Date: 20201008
Priority Date: 20191008

Claims (11)

1. A method of selecting an action from a set of actions to be performed in response to each context in a sequence of contexts, wherein each action in the set of actions is a recommendation for a content item, and each context in the sequence of contexts represents a feature vector characterizing a current recommendation setting including data describing an environment in which the content item is to be recommended, the method comprising: Maintaining data specifying a respective gating linear network corresponding to each action in the set of actions, wherein each gating linear network is configured to predict a probability that a return will be received if the corresponding action is performed in response to an input context, and wherein each gating linear network comprises a plurality of layers, each layer comprising one or more neurons, wherein each neuron in each layer after a first layer is configured to receive (i) the input context and (ii) a prediction from a neuron in a previous layer, and (iii) apply a gating function to the input context to select a weight vector, and (iv) generate a geometric mixture of predictions from neurons in the previous layer as an output based on the selected weight vector; For each context in the sequence of contexts: for each action, processing the context using a gated linear network corresponding to the action to generate a predictive probability; for each action, generating an action score for the action based at least on the predictive probability, including: Calculating a pseudo count for the action, wherein calculating the pseudo count includes determining an overlap between (i) a signature of a context across a gating function of neurons in a gated linear network for the action and (ii) a signature of any earlier context in the sequence of contexts for which the action is selected to be an action performed in response to the earlier context, wherein the signature of the context characterizes an output of the gating function of neurons in the gated linear network corresponding to the action generated by processing the context; generating the action score based on the predicted probability for the action and the pseudo count for the action, and An action to be performed is selected in response to the context based on the action score.
2. A method of selecting an action from a set of actions to be performed in response to each context in a sequence of contexts, wherein each action in the set of actions is a user interface element presented to a user in a presentation setting, and each context in the sequence of contexts represents a feature vector characterizing the presentation setting, the user, or both, the method comprising: Maintaining data specifying a respective gating linear network corresponding to each action in the set of actions, wherein each gating linear network is configured to predict a probability that a return will be received if the corresponding action is performed in response to an input context, and wherein each gating linear network comprises a plurality of layers, each layer comprising one or more neurons, wherein each neuron in each layer after a first layer is configured to receive (i) the input context and (ii) a prediction from a neuron in a previous layer, and (iii) apply a gating function to the input context to select a weight vector, and (iv) generate a geometric mixture of predictions from neurons in the previous layer as an output based on the selected weight vector; For each context in the sequence of contexts: for each action, processing the context using a gated linear network corresponding to the action to generate a predictive probability; for each action, generating an action score for the action based at least on the predictive probability, including: Calculating a pseudo count for the action, wherein calculating the pseudo count includes determining an overlap between (i) a signature of a context across a gating function of neurons in a gated linear network for the action and (ii) a signature of any earlier context in the sequence of contexts for which the action is selected to be an action performed in response to the earlier context, wherein the signature of the context characterizes an output of the gating function of neurons in the gated linear network corresponding to the action generated by processing the context; generating the action score based on the predicted probability for the action and the pseudo count for the action, and An action to be performed is selected in response to the context based on the action score.
3. The method of claim 1 or 2, wherein selecting an action to perform in response to the context based on the action score comprises selecting an action with a highest action score.
4. The method of claim 1 or 2, further comprising: For each context in the sequence of contexts: receive rewards, and A gated linear network for the selected action is updated based on the rewards.
5. The method of claim 4, wherein updating the gated linear network for the selected action comprises: Each neuron in the gated linear network is locally updated based on a neuron-specific loss.
6. The method of claim 1 or 2, wherein a last layer of the plurality of layers of each gated linear network comprises only a single neuron, and wherein the predictive probability of the gated linear network is an output of the single neuron.
7. The method of claim 1 or 2, wherein neurons in a first layer of the plurality of layers of each gated linear network receive the input context and a basic prediction set.
8. A method of selecting an action from a set of actions to be performed in response to each context in a sequence of contexts, wherein Each action in the set of actions is a recommendation for a content item, and each context in the sequence of contexts represents a feature vector characterizing a current recommendation setting, the current recommendation setting comprising data describing an environment in which the content item is to be recommended, The method comprises the following steps: Maintaining data specifying a respective gated linear network tree corresponding to each action in the set of actions, wherein each gated linear network tree is collectively configured to predict a respective probability for each of a plurality of intervals of a range of reward values, wherein the respective probability for each interval represents a likelihood that a reward falling in the interval will be received if the corresponding action is performed in response to an input context, and wherein each gated linear network of each gated linear network tree comprises a plurality of layers, each layer comprising one or more neurons, wherein each neuron in each layer following a first layer is configured to receive (i) the input context and (ii) a prediction from a neuron in a previous layer, and (iii) apply a gating function to the input context to select a weight vector, and (iv) generate a geometric mix of predictions from neurons in the previous layer as an output based on the selected weight vector; For each context in the sequence of contexts: For each action, processing the context using a gated linear network tree corresponding to the action to generate a respective probability for each of a plurality of intervals of the range of reward values; For each action, generating an action score for the action based at least on the respective probabilities, including: Calculating a pseudo count for the action, wherein calculating the pseudo count includes determining an overlap between (i) a signature of a context of a gating function across neurons in a gating linear network tree for the action and (ii) a signature of a context of a gating function across neurons in a gating linear network tree for any earlier context in the sequence of contexts for which the action is selected to be an action performed in response to the earlier context, wherein the signature of the context means an output of a gating function of a neuron in a gating linear network tree of actions generated by processing the context, and An action to be performed is selected in response to the context based on the action score.
9. A method of selecting an action from a set of actions to be performed in response to each context in a sequence of contexts, wherein each action in the set of actions is a user interface element presented to a user in a presentation setting, and each context in the sequence of contexts represents a feature vector characterizing the presentation setting, the user, or both, the method comprising: Maintaining data specifying a respective gated linear network tree corresponding to each action in the set of actions, wherein each gated linear network tree is collectively configured to predict a respective probability for each of a plurality of intervals of a range of reward values, wherein the respective probability for each interval represents a likelihood that a reward falling in the interval will be received if the corresponding action is performed in response to an input context, and wherein each gated linear network of each gated linear network tree comprises a plurality of layers, each layer comprising one or more neurons, wherein each neuron in each layer following a first layer is configured to receive (i) the input context and (ii) a prediction from a neuron in a previous layer, and (iii) apply a gating function to the input context to select a weight vector, and (iv) generate a geometric mix of predictions from neurons in the previous layer as an output based on the selected weight vector; For each context in the sequence of contexts: For each action, processing the context using a gated linear network tree corresponding to the action to generate a respective probability for each of a plurality of intervals of the range of reward values; For each action, generating an action score for the action based at least on the respective probabilities, including: Calculating a pseudo count for the action, wherein calculating the pseudo count includes determining an overlap between (i) a signature of a context of a gating function across neurons in a gating linear network tree for the action and (ii) a signature of a context of a gating function across neurons in a gating linear network tree for any earlier context in the sequence of contexts for which the action is selected to be an action performed in response to the earlier context, wherein the signature of the context means an output of a gating function of a neuron in a gating linear network tree of actions generated by processing the context, and An action to be performed is selected in response to the context based on the action score.
10. A system for selecting an action from a set of actions to be performed comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the method of any one of claims 1 to 9.
11. One or more computer storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of the method of any one of claims 1 to 9.

Description

Gated linear context gaming machine Technical Field The specification relates to selecting an action in response to a contextual input. Background In a context gaming machine (contextual bandits) scenario, an agent iteratively selects an action to be performed from a set of possible actions. At each iteration, the agent receives a context input associated with the iteration, and then selects an action for the iteration based on the context input. Disclosure of Invention The specification describes a system implemented as a computer program on one or more computers in one or more locations that selects an action to perform in response to received contextual input. Particular embodiments of the subject matter described in this specification can be implemented to realize one or more of the following advantages. The described system uses a gated linear network to select actions in a contextual gaming machine setting, i.e., in response to contextual inputs. Such an action selection scheme will be referred to as a gated linear context gaming machine. The use of a gated linear network to select an action results in a more accurate selection of an action in terms of received rewards while reducing the amount of computing resources required to generate the action selection. This may be due to several features of the described scheme. As one example, the described approach allows the system to estimate prediction uncertainty with an efficient zero algorithm overhead by utilizing the data-dependent gating properties of the GLN, allowing more accurate pseudo-counts to be calculated without increasing computational overhead, and resulting in a more efficient exploration of the space of possible actions. As another example, the system may calculate an action score for an action and update the weights of the gated linear network for the action in a single forward propagating gated linear network, eliminating the computationally intensive back propagation required to update the model weights of a conventional system that uses a conventional deep neural network to generate the action score. Because the gated linear network can be updated entirely online, the system does not need to store historical data other than small signature data for calculating the pseudo count, thereby greatly reducing the memory footprint of the system relative to other techniques that use neural networks to select actions. The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the following description, the accompanying drawings, and the claims. Drawings Fig. 1A illustrates an example contextual gaming machine system. Fig. 1B shows an example of a Gated Linear Network (GLN). FIG. 2 is a flow chart of an example process for selecting an action in response to a contextual input. FIG. 3 is a flow chart of another example process for selecting an action in response to a contextual input. Like reference numbers and designations in the various drawings indicate like elements. Detailed Description The present specification generally describes a system that repeatedly selects actions to perform in response to received contextual inputs. Each action is selected from a predetermined set of actions, and the system selects an action to attempt to maximize the return received in response to the selected action. Typically, the return is a value that measures the quality of the selected action. In some implementations, the return for each action is zero or one, while in other implementations, each return is a value derived from, for example, a continuous range between a lower return value and an upper return value. In some cases, the action is a recommendation of a content item (e.g., video, advertisement, image, search result, or other content segment), and the contextual input represents a feature vector characterizing a current recommendation setting, i.e., data describing the environment in which the content item will be recommended, such as any of a current time, attributes of the user device of the user to whom the recommendation will be displayed, attributes of previous content items that have been recommended to the user and user responses to those previous content items, and attributes of settings in which the content item will be placed. In these cases, the return value measures the quality of the recommendation. For example, the value may be a one if the user is interacting with the recommendation, and a zero if the user is not interacting with the recommendation. As another example, the reward value may be a value that measures the extent to which future users participate in the recommended content item to the user after the current recommendation is made. In some other cases, the action is a user interface element that can be presented to the user in the use