US-12626194-B2 - Scalable transfer learning with expert models

US12626194B2US 12626194 B2US12626194 B2US 12626194B2US-12626194-B2

Abstract

Generally, the present disclosure is directed to systems and methods that provide a simple, scalable, yet effective strategy to perform transfer learning with a mixture of experts (MoE). In particular, the transfer of pre-trained representations can improve sample efficiency and reduce computational requirements for new tasks. However, representations used for transfer are usually generic, and are not tailored to a particular distribution of downstream tasks. In contrast, example systems and methods of the present disclosure use expert representations for transfer with a simple, yet effective, strategy.

Inventors

Carlos Riquelme Ruiz
André Susano PINTO
Joan Puigcerver
Basil Mustafa
Neil Matthew Tinmouth Houlsby
Sylvain Gelly
Cedric Benjamin Renggli
Daniel Martin Keysers

Assignees

GOOGLE LLC

Dates

Publication Date: 20260512
Application Date: 20210607

Claims (20)

1 . A computer-implemented method to perform transfer learning from expert models, the method comprising: accessing, by a computing system one or more computing devices, a plurality of expert machine-learned models, wherein the plurality of expert machine-learned models have been respectively trained on a plurality of different training datasets, the plurality of expert machine-learned models respectively generated by addition of one or more respective expert adapter submodels to one or more existing layers of a baseline machine-learned model that was trained on a base training dataset; obtaining, by the computing system, data descriptive of a downstream task associated with a downstream training dataset; evaluating, by the computing system, a respective performance metric for each of the plurality of expert machine-learned models relative to the downstream task; selecting, by the computing system, one or more of the plurality of expert machine-learned models to serve as one or more selected machine-learned models based at least in part on the respective performance metrics for the plurality of expert machine-learned models; fine-tuning, by the computing system, the one or more selected machine-learned models using the downstream training dataset; and after fine-tuning the one or more selected machine-learned models, providing, by the computing system, the one or more selected machine-learned models for performance of the downstream task.
2 . The computer-implemented method of claim 1 , wherein the plurality of different training datasets on which the plurality of expert machine-learned models have been respectively trained comprise a plurality of different subportions of the base training dataset.
3 . The computer-implemented method of claim 2 , wherein training labels in the base training dataset are organized according to a hierarchy, and wherein the plurality of different subportions of the base training dataset comprise a plurality of different hierarchically-defined divisions of the base training dataset.
4 . The computer-implemented method of claim 1 , wherein the plurality of expert machine-learned models are respectively generated by fully fine- tuning the baseline machine-learned model.
5 . The computer-implemented method of claim 1 , wherein each expert machine-learned model comprises one or more residual skip connections respectively around the one or more expert adapter submodels.
6 . The computer-implemented method of claim 1 , wherein, during training of each expert machine-learned model on its respective training dataset, the one or more respective expert adapter submodels are learned while the one or more existing layers of the baseline machine-learned model are held fixed.
7 . The computer-implemented method of claim 1 , wherein, during fine-tuning of the one or more selected machine-learned models using the downstream training dataset, both the one or more respective expert adapter submodels and the one or more existing layers of the baseline machine-learned model are learned.
8 . The computer-implemented method of claim 1 , wherein selecting, by the computing system, one or more of the plurality of expert machine-learned models to serve as one or more selected machine-learned models based at least in part on the respective performance metrics for the plurality of expert machine-learned models comprises employing, by the computing system, an expert prediction network to select the one or more selected machine-learned models from the plurality of expert machine-learned models based on one or more training inputs included in the downstream training dataset.
9 . The computer-implemented method of claim 1 , wherein selecting, by the computing system, one or more of the plurality of expert machine-learned models to serve as one or more selected machine-learned models based at least in part on the respective performance metrics for the plurality of expert machine-learned models comprises: using, by the computing system, a generic network to predict labels for one or more training examples included in the downstream training dataset; evaluating, by the computing system, a respective KL-divergence between a distribution of the labels generated by the generic network and a per-expert prior on labels generated by each expert machine-learned model during training of such expert machine-learned model; and selecting, by the computing system, the one or more of the plurality of expert machine-learned models to serve as the one or more selected machine-learned models based at least in part on the respective KL-divergence for each of the plurality of expert machine-learned models.
10 . The computer-implemented method of claim 1 , wherein selecting, by the computing system, one or more of the plurality of expert machine-learned models to serve as one or more selected machine-learned models based at least in part on the respective performance metrics for the plurality of expert machine-learned models comprises: using, by the computing system, each expert machine-learned model to produce a respective input embedding for a training input included in the downstream training dataset; evaluating, by the computing system, a performance proxy for each expert machine-learned model based on application of nearest neighbors to the respective input embedding; and selecting, by the computing system, the one or more of the plurality of expert machine-learned models to serve as the one or more selected machine-learned models based at least in part on the respective performance proxy for each of the plurality of expert machine-learned models.
11 . The computer-implemented method of claim 1 , wherein accessing, by the computing system, the plurality of expert machine-learned models comprises generating and training, by the computing system, the plurality of expert machine-learned models.
12 . The computer-implemented method of claim 1 , wherein the downstream task comprises an image recognition task.
13 . A computing system, comprising: one or more processors; and one or more non-transitory computer-readable media that store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising: accessing, by the computing system, a plurality of expert machine-learned models, wherein the plurality of expert machine-learned models have been respectively trained on a plurality of different training datasets, the plurality of expert machine-learned models respectively generated by addition of one or more respective expert adapter submodels to one or more existing layers of a baseline machine-learned model that was trained on a base training dataset; obtaining, by the computing system, data descriptive of a downstream task associated with a downstream training dataset; evaluating, by the computing system, a respective performance metric for each of the plurality of expert machine-learned models relative to the downstream task; selecting, by the computing system, one or more of the plurality of expert machine-learned models to serve as one or more selected machine-learned models based at least in part on the respective performance metrics for the plurality of expert machine-learned models; fine-tuning, by the computing system, the one or more selected machine-learned models using the downstream training dataset; and after fine-tuning the one or more selected machine-learned models, providing, by the computing system, the one or more selected machine-learned models for performance of the downstream task.
14 . The computing system of claim 13 , wherein the plurality of different training datasets on which the plurality of expert machine-learned models have been respectively trained comprise a plurality of different subportions of the base training dataset.
15 . The computing system of claim 14 , wherein training labels in the base training dataset are organized according to a hierarchy, and wherein the plurality of different subportions of the base training dataset comprise a plurality of different hierarchically-defined divisions of the base training dataset.
16 . One or more non-transitory computer-readable media that store a downstream machine-learned model generated through performance of operations, the operations comprising: accessing, by a computing system, a plurality of expert machine-learned models, wherein the plurality of expert machine-learned models have been respectively trained on a plurality of different training datasets, the plurality of expert machine-learned models respectively generated by addition of one or more respective expert adapter submodels to one or more existing layers of a baseline machine-learned model that was trained on a base training dataset; obtaining, by the computing system, data descriptive of a downstream task associated with a downstream training dataset; evaluating, by the computing system, a respective performance metric for each of the plurality of expert machine-learned models relative to the downstream task; selecting, by the computing system, one or more of the plurality of expert machine-learned models to serve as one or more selected machine-learned models based at least in part on the respective performance metrics for the plurality of expert machine-learned models; fine-tuning, by the computing system, the one or more selected machine-learned models using the downstream training dataset; and after fine-tuning the one or more selected machine-learned models, providing, by the computing system, the one or more selected machine-learned models as the downstream machine-learned model for performance of the downstream task.
17 . The one or more non-transitory computer-readable media of claim 16 , wherein: the plurality of different training datasets on which the plurality of expert machine-learned models have been respectively trained comprise a plurality of different subportions of the base training dataset; training labels in the base training dataset are organized according to a hierarchy; and the plurality of different subportions of the base training dataset comprise a plurality of different hierarchically-defined divisions of the base training dataset.
18 . The computing system of claim 13 , wherein each expert machine-learned model comprises one or more residual skip connections respectively around the one or more respective expert adapter submodels.
19 . The one or more non-transitory computer-readable media of claim 16 , wherein during training of each expert machine-learned model on its respective training dataset, the one or more respective expert adapter submodels are learned while the one or more existing layers of the baseline machine-learned model are held fixed.
20 . The one or more non-transitory computer-readable media of claim 16 , wherein, during fine-tuning of the one or more selected machine-learned models using the downstream training dataset, both the one or more respective expert adapter submodels and the one or more existing layers of the baseline machine-learned model are learned.

Description

RELATED APPLICATIONS This application is based upon and claims the right of priority under 35 U.S.C. § 371 to International Application No. PCT/US2021/036197 filed on Jun. 7, 2021, which claims priority to U.S. Provisional Patent Application No. 63/035,341, filed Jun. 5, 2020. Each of the applications identified above which is hereby incorporated by reference in its entirety. FIELD The present disclosure relates generally to machine learning. More particularly, the present disclosure relates to transfer learning using a mixture of experts. BACKGROUND Deep learning has been successful on many computer vision tasks. Unfortunately, this success often requires a large amount of per-task data and compute. To scale deep learning to new vision tasks, practitioners often turn to transfer learning. Transfer learning involves re-using models trained on a large source task and tuning them on the target task. This can improve both convergence rates and empirical performance. Transfer learning reduces per-task data or compute requirements, given a large one-off pre-training cost. In practice, this one-off down payment may not be made by the practitioner since pre-trained networks are made available through online platforms. Transfer of specialist models has been studied before. However, previous approaches are limited in their scalability and task diversity. They either require expensive re-training on the source dataset for every target task or operate at a small scale where all experts can be applied simultaneously. Further, most of them are tested only on a limited suite of natural single-object classification tasks. SUMMARY Aspects and advantages of embodiments of the present disclosure will be set forth in part in the following description, or can be learned from the description, or can be learned through practice of the embodiments. One example aspect of the present disclosure is directed to a computer-implemented method to perform transfer learning from expert models. The method can include accessing, by a computing system comprising one or more computing devices, a plurality of expert machine-learned models, wherein the plurality of expert machine-learned models have been respectively trained on a plurality of different training datasets. The method can include obtaining, by the computing system, data descriptive of a downstream task associated with a downstream training dataset. The method can include evaluating, by the computing system, a respective performance metric for each of the plurality of expert machine-learned models relative to the downstream task. The method can include selecting, by the computing system, one or more of the plurality of expert machine-learned models to serve as one or more selected machine-learned models based at least in part on the respective performance metrics for the plurality of expert machine-learned models. The method can include fine-tuning, by the computing system, the one or more selected machine-learned models using the downstream training dataset. The method can include, after fine-tuning the one or more selected machine-learned models, providing, by the computing system, the one or more selected machine-learned models for performance of the downstream task. In some implementations, the plurality of expert machine-learned models comprise a plurality of variants of a baseline machine-learned model that was trained on a base training dataset. In some implementations, wherein the plurality of different training datasets on which the plurality of expert machine-learned models have been respectively trained comprise a plurality of different subportions of the base training dataset. The subportions may be overlapping or not overlapping. In some implementations, training labels in the base training dataset are organized according to a hierarchy. In some implementations, the plurality of different subportions of the base training dataset comprise a plurality of different hierarchically-defined divisions of the base training dataset. In some implementations, the plurality of expert machine-learned models are respectively generated by fully fine-tuning the baseline machine-learned model In some implementations, the plurality of expert machine-learned models are respectively generated by addition of one or more respective expert adapter submodels to one or more existing layers of the baseline machine-learned model. In some implementations, each expert machine-learned model comprises one or more residual skip connections respectively around the one or more expert adapter submodels. In some implementations, during training of each expert machine-learned model on its respective training dataset, the one or more respective expert adapter submodels are learned while the one or more existing layers of the baseline machine-learned model are held fixed. In some implementations, during fine-tuning of the one or more selected machine-learned models using the downstream training dataset, bot