Search

US-12620496-B2 - Simulated training data generation for a multi-armed bandit model

US12620496B2US 12620496 B2US12620496 B2US 12620496B2US-12620496-B2

Abstract

An online system adjusts a guardrail setting used by a user treatment engine based on conditions faced by the online system. The online system simulates the performance of the user treatment engine using different candidate guardrail settings and computes a score for each of the guardrail settings based on the performance of the user treatment engine using each of the guardrail settings. The online system selects a new guardrail setting for the user treatment engine based on the performance scores for the candidate guardrail settings. Furthermore, the online system generates simulated training examples to initially train a user treatment engine. The online system uses a treatment performance model to simulate the effect of treatments applied to users and generates simulated training examples based on the predicted effect of the treatments. The online system retrains the user treatment engine on real training examples that are generated based on actual treatments.

Inventors

  • Xiao Gong
  • Konrad Gustav Miziolek

Assignees

  • MAPLEBEAR INC.

Dates

Publication Date
20260505
Application Date
20230228

Claims (20)

  1. 1 . A method comprising, at a computer system comprising a processor and a computer-readable medium: initializing a user treatment engine comprising a multi-armed bandit model for selecting treatments to apply to users; accessing a treatment performance model, wherein the treatment performance model is a machine-learning model that is trained to predict a number of orders that will be serviced by a user within a time period after a treatment is applied to the user by an online concierge system, wherein the treatment performance model is trained to make predictions based on treatment data describing the treatment applied to the user and user data describing the user; generating a set of simulated training examples for the user treatment engine by applying the treatment performance model to user data for a first plurality of users and treatment data for a set of treatments, wherein each simulated training example of the set of simulated training examples indicates a predicted number of orders to be serviced by a user of the first plurality of users after a treatment of the set of treatments is applied to the user; training the user treatment engine based on the set of simulated training examples; generating a set of real training examples by applying treatments to a second plurality of users of the online concierge system, wherein the applied treatments are selected by the user treatment engine based on user data associated with the second plurality of users, wherein applying a treatment to a user of the second plurality of users comprises: transmitting instructions to a client device associated with the user of the second plurality of users, wherein the instructions cause the client device to present content to the user based on a treatment of the applied treatments; receiving data from the client device describing user interactions by the user with the presented content through the client device; and generating a real training example comprising user data associated with the user of the second plurality of users, treatment data for the treatment of the applied treatments, and a label describing the data describing the user interactions by the user; retraining the user treatment engine based on the set of real training examples and a subset of the simulated training examples; accessing user data associated with a target user of the online concierge system; generating a score for a candidate treatment by applying the user treatment engine to the accessed user data and treatment data for the candidate treatment; and applying the candidate treatment to the user based on the generated score, wherein applying the candidate treatment comprises transmitting content relating to the treatment to a client device associated with the target user for display to the target user.
  2. 2 . The method of claim 1 , wherein initializing the user treatment engine comprises: storing the multi-armed bandit model with a set of default parameters.
  3. 3 . The method of claim 1 , wherein the treatment performance model is a neural network.
  4. 4 . The method of claim 1 , further comprising: training the treatment performance model based on a set of training examples, wherein each training example indicates a number of orders serviced by a user within the time period after a treatment is applied to the user.
  5. 5 . The method of claim 1 , wherein the time period is at least one of: 24 hours after a treatment is applied to a user, 48 hours after a treatment is applied to a user, or a week after a treatment is applied to a user.
  6. 6 . The method of claim 1 , wherein generating the set of simulated training examples comprises: generating a minimum number of simulated training examples.
  7. 7 . The method of claim 1 , further comprising: continually retraining the user treatment engine based on an increasing number of generated real training examples and a decreasing number of simulated training examples.
  8. 8 . The method of claim 7 , wherein continually retraining the user treatment engine comprises: training the user treatment engine based on real training examples and based on none of the generated set of simulated training examples.
  9. 9 . The method of claim 1 , wherein generating the set of real training examples comprises: selecting treatments to apply to the second plurality of users based on the user treatment engine, wherein the user treatment engine uses a guardrail setting to limit which treatments of a set of candidate treatments may be selected.
  10. 10 . The method of claim 9 , further comprising: selecting a new guardrail setting for the user treatment engine.
  11. 11 . A non-transitory computer-readable medium storing instructions that, when executed by a processor, cause the processor to: initialize a user treatment engine comprising a multi-armed bandit model for selecting treatments to apply to users; access a treatment performance model, wherein the treatment performance model is a machine-learning model that is trained to predict a number of orders that will be serviced by a user within a time period after a treatment is applied to the user by an online concierge system, wherein the treatment performance model is trained to make predictions based on treatment data describing the treatment applied to the user and user data describing the user; generate a set of simulated training examples for the user treatment engine by applying the treatment performance model to user data for a first plurality of users and treatment data for a set of treatments, wherein each simulated training example of the set of simulated training examples indicates a predicted number of orders to be serviced by a user of the first plurality of users after a treatment of the set of treatments is applied to the user; train the user treatment engine based on the set of simulated training examples; generate a set of real training examples by applying treatments to a second plurality of users of the online concierge system, wherein the applied treatments are selected by the user treatment engine based on user data associated with the second plurality of users, wherein applying a treatment to a user of the second plurality of users comprises: transmitting instructions to a client device associated with the user of the second plurality of users, wherein the instructions cause the client device to present content to the user based on a treatment of the applied treatments; receiving data from the client device describing user interactions by the user with the presented content through the client device; and generating a real training example comprising user data associated with the user of the second plurality of users, treatment data for the treatment of the applied treatments, and a label describing the data describing the user interactions by the user; retrain the user treatment engine based on the set of real training examples and a subset of the simulated training examples; access user data associated with a target user of the online concierge system; generate a score for a candidate treatment by applying the user treatment engine to the accessed user data and treatment data for the candidate treatment; and apply the candidate treatment to the user based on the generated score, wherein applying the candidate treatment comprises transmitting content relating to the treatment to a client device associated with the target user for display to the target user.
  12. 12 . The computer-readable medium of claim 11 , wherein the instructions for initializing the user treatment engine comprise instructions that cause the processor to: store the multi-armed bandit model with a set of default parameters.
  13. 13 . The computer-readable medium of claim 11 , wherein the treatment performance model is a neural network.
  14. 14 . The computer-readable medium of claim 11 , further storing instructions that cause the processor to: train the treatment performance model based on a set of training examples, wherein each training example indicates a number of orders serviced by a user within the time period after a treatment is applied to the user.
  15. 15 . The computer-readable medium of claim 11 , wherein the time period is at least one of: 24 hours after a treatment is applied to a user, 48 hours after a treatment is applied to a user, or a week after a treatment is applied to a user.
  16. 16 . The computer-readable medium of claim 11 , wherein the instructions for generating the set of simulated training examples comprise instructions that cause the processor to: generate a minimum number of simulated training examples.
  17. 17 . The computer-readable medium of claim 11 , further storing instructions that cause the processor to: continually retrain the user treatment engine based on an increasing number of generated real training examples and a decreasing number of simulated training examples.
  18. 18 . The computer-readable medium of claim 17 , wherein the instructions for continually retraining the user treatment engine comprise instructions that cause the processor to: train the user treatment engine based on real training examples and based on none of the generated set of simulated training examples.
  19. 19 . The computer-readable medium of claim 11 , wherein the instructions for generating the set of real training examples comprise instructions that cause the processor to: select treatments to apply to the second plurality of users based on the user treatment engine, wherein the user treatment engine uses a guardrail setting to limit which treatments of a set of candidate treatments may be selected.
  20. 20 . A system comprising: a processor; and a non-transitory computer-readable medium storing instructions that, when executed by the processor, cause the system to: initialize a user treatment engine comprising a multi-armed bandit model for selecting treatments to apply to users; access a treatment performance model, wherein the treatment performance model is a machine-learning model that is trained to predict a number of orders that will be serviced by a user within a time period after a treatment is applied to the user by an online concierge system, wherein the treatment performance model is trained to make predictions based on treatment data describing the treatment applied to the user and user data describing the user; generate a set of simulated training examples for the user treatment engine by applying the treatment performance model to user data for a first plurality of users and treatment data for a set of treatments, wherein each simulated training example of the set of simulated training examples indicates a predicted number of orders to be serviced by a user of the first plurality of users after a treatment of the set of treatments is applied to the user; train the user treatment engine based on the set of simulated training examples; generate a set of real training examples by applying treatments to a second plurality of users of the online concierge system, wherein the applied treatments are selected by the user treatment engine based on user data associated with the second plurality of users, wherein applying a treatment to a user of the second plurality of users comprises: transmitting instructions to a client device associated with the user of the second plurality of users, wherein the instructions cause the client device to present content to the user based on a treatment of the applied treatments; receiving data from the client device describing user interactions by the user with the presented content through the client device; and generating a real training example comprising user data associated with the user of the second plurality of users, treatment data for the treatment of the applied treatments, and a label describing the data describing the user interactions by the user; retrain the user treatment engine based on the set of real training examples and a subset of the simulated training examples; access user data associated with a target user of the online concierge system; generate a score for a candidate treatment by applying the user treatment engine to the accessed user data and treatment data for the candidate treatment; and apply the candidate treatment to the user based on the generated score, wherein applying the candidate treatment comprises transmitting content relating to the treatment to a client device associated with the target user for display to the target user.

Description

BACKGROUND Online systems, such as an online concierge system, often apply treatments to users to encourage those users to interact with the online system. For example, an online system may notify a user of new content that is available to the user or may provide incentives to a user if the user performs an interaction with the online system. When an online system has a set of candidate treatments that could be applied to a user, the online system has to balance exploring the uncertain efficacy of some of the treatments with maximizing the known efficacy of others. An online system may use a user treatment engine to automatically balance exploration with maximization. However, treatments may incur costs to the online system for their application to users, and thus an online system must limit the treatments that the user treatment system can select. For example, the online system may use a guardrail setting that establishes a limit on which treatments (or variants thereof) a user treatment engine can select. Guardrail settings are commonly heuristics that are hardcoded by engineers to limit the actions of systems like user treatment engines, and while these heuristics may work most of the time, the proper guardrail setting for a user treatment engine can change with time. For example, in the context of an online concierge system where a user treatment engine applies treatments to pickers to encourage pickers to service orders, a guardrail setting that is too strict may cause too few pickers to be available to be assigned orders to service by the online concierge system. Similarly, a guardrail setting that is too lax may cause too many pickers to be available and incur significant costs for the online system. Furthermore, the conditions that cause a guardrail setting to be too strict or too lax can change over time, meaning that a hardcoded guardrail setting is likely to encounter these problems eventually. Furthermore, a user treatment engine may encounter a cold start problem with the machine-learning model that the engine uses to select treatments. A user treatment engine may use a multi-armed bandit model to balance exploration and maximization for applying treatments to users. However, multi-armed bandit models commonly need to be trained on existing training examples to effectively select treatments for application to users. For example, a multi-armed bandit with no training examples may over explore the efficacy of treatments and under utilize treatments with known efficacy. Thus, an online system using a multi-armed bandit model commonly must execute the multi-armed bandit model for a certain period of time (e.g., two weeks) without using the output of the model while the system collects the necessary training data for the model to provide a useful output. Therefore, multi-armed bandit models can be slow to release to production, making iterations on those models difficult. SUMMARY In accordance with one or more aspects of the disclosure, an online concierge system dynamically adjusts guardrail settings for a user treatment engine that selects treatments to apply to users. A guardrail setting is a setting for the user treatment engine that limits the treatments or variants that the user treatment engine can select. For example, a guardrail setting may enforce a limit on a cost of an individual treatment, a total cost of a set of treatments, or a frequency of how often a particular treatment is selected. The user treatment engine selects treatments to apply to users based on the guardrail setting, and the online concierge system applies the selected treatments to users. The online concierge system occasionally adjusts the guardrail settings for the user treatment engine by selecting a new guardrail setting for the user treatment engine from a set of candidate guardrail settings. The online concierge system computes a performance score for each of the candidate guardrail settings by simulating the performance of the user treatment engine if the user treatment engine used the candidate guardrail setting. For example, the online concierge system may predict a number of orders that users would service if the user treatment engine selected treatments using a candidate guardrail setting. The online concierge system may then compare that predicted number of orders to a predicted number of future orders that the online concierge system expects to receive within some time period in the future (e.g., the next 24 hours). The online concierge system scores the performance of each candidate guardrail setting based on a difference between those numbers, i.e., based on a difference between the number of orders the online concierge system expects to receive and the predicted number of orders that would be serviced using the candidate guardrail setting. The online concierge system selects a new guardrail setting for the user treatment engine based on the respective scores of the candidate guardrail settings. By simulat