US-12626128-B2 - Continual machine learning in a provider network

US12626128B2US 12626128 B2US12626128 B2US 12626128B2US-12626128-B2

Abstract

A system and method for continual learning in a provider network. The method is configured to implement or interface with a system which implements a semi-automated or fully automated architecture of continual machine learning, the semi-automated or fully automated architecture implementing user-configurable model retraining or hyperparameter tuning, which is enabled by a provider network. This functions to adapt a model over time to new information in the training data while also providing a user-friendly, flexible, and customizable continual learning process.

Inventors

Giovanni Zappella
Lukas Stefan BALLES
Beyza Ermis
Martin WISTUBA
Cedric Philippe Archambeau

Assignees

AMAZON TECHNOLOGIES, INC.

Dates

Publication Date: 20260512
Application Date: 20220930

Claims (19)

1 . A method for continual learning in a provider network, the method comprising: registering a pre-trained target machine learning model in a model registry in the provider network; sending a model updating signal from the provider network; receiving, at the provider network, a command to trigger a model updating pipeline; performing, by the model updating pipeline implemented as code executing on one or more processors in the provider network, a hyperparameter tuning job and a plurality of machine learning model retraining jobs to yield a plurality of updated machine learning models; selecting, by the model updating pipeline implemented as code executing on the one or more processors, one of the plurality of updated machine learning models as an updated machine learning model according to a machine learning model performance metric computed for each machine learning model retraining job of the plurality of machine learning model retraining jobs; registering the updated machine learning model in the model registry; and deploying the updated machine learning model to an inference endpoint in the provider network.
2 . The method of claim 1 , wherein performing each machine learning model retraining job of the plurality of machine learning model retraining jobs comprises executing a continual learning algorithm as part of the machine learning model retraining job.
3 . A method comprising: sending a model updating signal pertaining to a pre-trained target machine learning model; receiving a command to trigger a model updating pipeline for the pre-trained target machine learning model; performing, by the model updating pipeline implemented as code executing on one or more processors, a hyperparameter tuning job and a plurality of machine learning model retraining jobs to yield a plurality of updated machine learning models, each machine learning model retraining job of the plurality of machine learning model retraining jobs performed using a different set of hyperparameter values; and selecting, by the model updating pipeline implemented as code executing on the one or more processors, one of the plurality of updated machine learning models as an updated machine learning model according to a machine learning model performance metric computed for each machine learning model retraining job of the plurality of machine learning model retraining jobs.
4 . The method of claim 3 , further comprising: registering the updated machine learning model in a model registry; and deploying the updated machine learning model to an inference endpoint.
5 . The method of claim 3 , wherein performing each machine learning model retraining job of the plurality of machine learning model retraining jobs comprises executing a continual learning algorithm as part of the machine learning model retraining job.
6 . The method of claim 3 , wherein sending the model updating signal pertaining to the pre-trained target machine learning model is based on detecting an event indicating that the pre-trained target machine learning model should be updated.
7 . The method of claim 6 , wherein detecting the event indicating that the pre-trained target model should be updated comprises determining that a data quality metric for the pre-trained target model does not meet a threshold.
8 . The method of claim 6 , wherein detecting the event indicating that the pre-trained target model should be updated comprises determining that a model quality metric for the pre-trained target model does not meet a threshold.
9 . The method of claim 6 , wherein detecting the event indicating that the pre-trained target model should be updated comprises determining that a bias metric for the pre-trained target model does not meet a threshold.
10 . The method of claim 6 , wherein detecting the event indicating that the pre-trained target model should be updated comprises determining that a feature attribution metric for the pre-trained target model does not meet a threshold.
11 . The method of claim 3 , wherein the pre-trained target model is a deep artificial neural network model.
12 . The method of claim 3 , wherein the model updating pipeline comprises a processing step, a tuning step, and one or more retraining steps.
13 . A system comprising: one or more electronic devices to implement a machine learning service in a provider network, the machine learning service comprising instructions which when executed cause the machine learning service to: receive a command to trigger a model updating pipeline for a pre-trained target model; perform, by the model updating pipeline implemented as code executing on one or more processors, a hyperparameter tuning job and a plurality of machine learning model retraining jobs as part of an execution of the model updating pipeline to yield a plurality of updated machine learning models; select, by the model updating pipeline implemented as code executing on the one or more processors, one of the plurality of updated machine learning models as an updated machine learning model according to a machine learning model performance metric computed for each machine learning model retraining job of the plurality of machine learning model retraining jobs; register the updated machine learning model in a model registry in the provider network; and deploy the updated machine learning model to an inference endpoint in the provider network.
14 . The system of claim 13 , wherein the machine learning service further comprises instructions which when executed cause the machine learning service to execute a continual learning algorithm based on a batch of training data in a sequence of a plurality of training data batches to yield the updated machine learning model.
15 . The system of claim 14 , further comprising one or more electronic devices to implement a model monitoring service for monitoring the pre-trained target model deployed at the inference endpoint, the model monitoring service comprising instructions which when executed cause the model monitoring service to detect an event indicating that the pre-trained target model should be updated.
16 . The system of claim 15 , wherein the event comprises a metric exceeding or falling below a threshold.
17 . The system of claim 16 , wherein the metric pertains to data quality, model quality, bias, or feature attribution of the pre-trained target model.
18 . The system of claim 13 , wherein the pre-trained target model is a deep artificial neural network model.
19 . The system of claim 13 , wherein the model updating pipeline comprises a processing step, a tuning step, and one or more retraining steps.

Description

TECHNICAL FIELD The present disclosure relates generally to cloud machine learning platform systems and methods for creating, training, and deploying machine learning models in the cloud, and more specifically to a new and useful system and method for continual machine learning in the cloud machine learning platform field. BACKGROUND In cloud machine learning platforms, conventional systems and methods for continual machine learning rely on heuristics. For example, a machine learning model may be periodically retrained from scratch at a predetermined frequency (e.g., daily) using the most recently obtained training data within a sliding time window of predetermined length (e.g., the past three months). However, the heuristic approach comes with limitations. First, users may select a higher than necessary model retraining frequency to prevent model performance degradation (e.g., to prevent degradation in model inference accuracy). Second, retraining from scratch can be wasteful of compute resources, such as when there is no significant change in the distribution of the training data. Third, the tuning of model hyperparameters that could increase model performance is often ignored or avoided. Thus, there is a need in the cloud machine learning platform field to create an improved and useful system and method for continual machine learning. BRIEF DESCRIPTION OF DRAWINGS Various examples in accordance with the present disclosure will be described with reference to the drawings, in which: FIG. 1 is a schematic of a provider network system for continual learning. FIG. 2 is a schematic of a method for continual learning in a provider network. FIG. 3 is a schematic of a method for registering a target model in a model registry. FIG. 4 is a schematic of a method for sending a model updating signal. FIG. 5 is a schematic of a method for receiving a command to trigger a model updating pipeline. FIG. 6 is a schematic of retraining the target model. FIG. 7 is a schematic of tuning hyperparameters for the target model. FIG. 8 illustrates a provider network environment in which the techniques disclosed herein can be implemented, according to some examples. FIG. 9 illustrates an electronic device that can be used in an implementation of the techniques disclosed herein, according to some examples. It will be appreciated that for simplicity or clarity of illustration, elements illustrated in the figures have not necessarily been drawn to scale. For example, the dimensions of an element may be exaggerated relative to another element for clarity. Further, if considered appropriate, reference numerals have been repeated among the figures to indicate corresponding or analogous elements. DETAILED DESCRIPTION The following description is not intended to limit the invention to the examples described, but rather to enable any skilled person in the art to make and use this invention. 1. Overview The present disclosure relates to a system and a method for continual machine learning in a provider network. As shown in FIG. 1, a system 100 for continual machine learning includes at least one a retraining job 102 or at least one tuning job 104 executed in a provider network 106 to yield an updated machine learning model 108. Additionally or alternatively, the system 100 can include or interface with any or all of: a machine learning service 110, a labeling service 112, a storage service 114, a monitoring and observability service 116, or any other suitable components or combination of components. As shown in FIG. 2, a method 200 for continual machine learning in a provider network includes at least one of retraining a target machine learning model S210 or tuning hyperparameters S220 to yield a new version of the target model. Additionally or alternatively, the method 2000 can include any or all of: registering the target model in a model registry S202; sending a model updating signal S204; receiving a command to trigger a model updating pipeline S206; registering the new version of the target model in the model registry S222; deploying the new version of the target model to an inference endpoint S224; or any other suitable processes. The method can be performed with a system as described above or with any other suitable system. 2. Benefits The system or the method for continual machine learning in a provider network can confer the benefit of model retraining or hyperparameter tuning through a semi-automated or a fully automated process, achieving continual machine learning. This in turn confers the benefit of achieving model adaptability by any or all of: refining understanding of previous learned concepts over time as new training data with new information becomes available; learning new concepts over time as new training data with new information becomes available; avoiding model performance degradation over time (e.g., catastrophic forgetting) in rapidly evolving domains (e.g., online advertising, online retail; online music streaming