US-20260127416-A1 - BLENDED OUTPUT GENERATION IN GENERATIVE ARTIFICIAL INTELLIGENCE MODELS

US20260127416A1US 20260127416 A1US20260127416 A1US 20260127416A1US-20260127416-A1

Abstract

Techniques and apparatus for generating content with multiple specified attributes using a generative artificial intelligence model are described. An example method generally includes receiving a request to generate an output of a machine learning model, the request specifying a plurality of attributes of the output of the machine learning model. A set of intermediate outputs is generated via a plurality of adapters of the machine learning model. Each respective adapter of the plurality of adapters may be associated with a respective attribute of the specified plurality of attributes and include a respective mask in a low-rank dimension associated with the respective adapter. The set of intermediate outputs is merged into a combined output of the plurality of adapters of the machine learning model, and the output of the machine learning model is generated based on the combined output of the plurality of adapters.

Inventors

Aniket ROY
Shweta Mahajan
Shubhankar Mangesh BORSE
Shreya KADAMBI
Ankita NAYAK
Risheek GARREPALLI
Hyojin Park
Debasmit DAS
Munawar HAYAT
Fatih Murat PORIKLI

Assignees

QUALCOMM INCORPORATED

Dates

Publication Date: 20260507
Application Date: 20241107

Claims (20)

1 . A processing system for machine learning, comprising: at least one memory having executable instructions stored thereon; and one or more processors configured to execute the executable instructions in order to cause the processing system to: receive a request to generate an output of a machine learning model, the request specifying a plurality of attributes for the output of the machine learning model; generate a set of intermediate outputs via a plurality of adapters of the machine learning model, each respective adapter of the plurality of adapters being associated with a respective attribute of the specified plurality of attributes and being trained based on a cycle-consistency loss between different attributes of the specified plurality of attributes; merge the set of intermediate outputs into a combined output of the plurality of adapters of the machine learning model; and generate the output of the machine learning model based on the combined output of the plurality of adapters.
2 . The processing system of claim 1 , wherein the output of the machine learning model is an image output and wherein the plurality of attributes includes: an object to be depicted in the image output generated by the machine learning model; and a style of the image output generated by the machine learning model.
3 . The processing system of claim 1 , wherein the machine learning model comprises a model trained further based on one or more of a content loss or a style loss, and wherein the cycle-consistency loss comprises a first cycle-consistency loss for a first forward and backward transformation of content with different styles and a second cycle-consistency loss for a second forward and backward transformation of style with different contents.
4 . The processing system of claim 1 , wherein the cycle-consistency loss is weighted based on a defined scaling factor.
5 . The processing system of claim 1 , wherein the set of intermediate outputs comprises a plurality of images, each image of the plurality of images corresponding to an image conforming to an attribute from the specified plurality of attributes.
6 . The processing system of claim 1 , wherein a first adapter of the plurality of adapters is biased to operating on earlier layers in the machine learning model over a second adapter of the plurality of adapters and wherein the second adapter is biased to operating on later layers in the machine learning model over the first adapter.
7 . The processing system of claim 1 , wherein at least one of the adapters has frozen weights and an output mask with learnable weights associated with the cycle-consistency loss.
8 . The processing system of claim 1 , wherein at least one of the adapters has learnable weights associated with the cycle-consistency loss in a rank dimension of the at least one of the adapters.
9 . A processing system for machine learning, comprising: at least one memory having executable instructions stored thereon; and one or more processors configured to execute the executable instructions in order to cause the processing system to: receive a first data set associated with a first attribute of an output for which a machine learning model is to be trained; receive a second data set associated with a second attribute of the output for which the machine learning model is to be trained; train a first adapter of the machine learning model to finetune outputs in accordance with the first attribute based on the first data set; train a second adapter of the machine learning model to finetune outputs in accordance with the second attribute based on the second data set; train a merged adapter based on a cycle-consistency loss between the first attribute and the second attribute, the merged adapter comprising the first adapter and the second adapter; and deploy the machine learning model with the trained merged adapter.
10 . The processing system of claim 9 , wherein the first attribute comprises an object to be depicted in an image output generated by the machine learning model and wherein the second attribute comprises a style for the image output generated by the machine learning model.
11 . The processing system of claim 9 , wherein the cycle-consistency loss comprises a first cycle-consistency loss for a first forward and backward transformation of content with different styles and a second cycle-consistency loss for a second forward and backward transformation of style with different contents.
12 . The processing system of claim 11 , wherein to calculate the first cycle-consistency loss, the one or more processors are configured to cause the processing system to: access a sample including the content with a first style; apply a second style to the sample using the second adapter to generate a style-injected content sample; reconstruct the sample by applying the first style to the style-injected content sample to remove the second style, using the second adapter; and calculate the first cycle-consistency loss based on a difference between the sample and the reconstructed sample.
13 . The processing system of claim 11 , wherein to calculate the second cycle-consistency loss, the one or more processors are configured to cause the processing system to: access a sample including a first content according to the style; change the first content in the sample to a second content using the first adapter to generate a content-injected style sample; reconstruct the sample by changing the second content in the content-injected style sample to the first content, using the first adapter; and calculate the second cycle-consistency loss based on a difference between the sample and the reconstructed sample.
14 . The processing system of claim 9 , wherein at least one of the first adapter or the second adapter has frozen weights and an output mask with learnable weights associated with the cycle-consistency loss.
15 . The processing system of claim 9 , wherein at least one of the first adapter or the second adapter has learnable weights associated with the cycle-consistency loss in a rank dimension of the at least one of the first adapter or the second adapter.
16 . A processor-implemented method for machine learning, comprising: receiving a request to generate an output of a machine learning model, the request specifying a plurality of attributes for the output of the machine learning model; generating a set of intermediate outputs via a plurality of adapters of the machine learning model, each respective adapter of the plurality of adapters being associated with a respective attribute of the specified plurality of attributes and being trained based on a cycle-consistency loss between different attributes of the specified plurality of attributes; merging the set of intermediate outputs into a combined output of the plurality of adapters of the machine learning model; and generating the output of the machine learning model based on the combined output of the plurality of adapters.
17 . The method of claim 16 , wherein the output of the machine learning model is an image output and wherein the plurality of attributes includes: an object to be depicted in the image output generated by the machine learning model; and a style of the image output generated by the machine learning model.
18 . The method of claim 16 , wherein the machine learning model comprises a model trained further based on one or more of a content loss or a style loss, and wherein the cycle-consistency loss comprises a first cycle-consistency loss for a first forward and backward transformation of content with different styles and a second cycle-consistency loss for a second forward and backward transformation of style with different contents.
19 . The method of claim 16 , wherein the cycle-consistency loss is weighted based on a defined scaling factor.
20 . The method of claim 16 , wherein a first adapter of the plurality of adapters is biased to operating on earlier layers in the machine learning model over a second adapter of the plurality of adapters and wherein the second adapter is biased to operating on later layers in the machine learning model over the first adapter.

Description

INTRODUCTION Aspects of the present disclosure relate to generative artificial intelligence models. Generative artificial intelligence models can be used in various environments in order to generate a response to an input prompt (also referred to as a query or an input). For example, generative artificial intelligence models can be used in chatbot applications in which large language models (LLMs) are used to generate an answer, or at least a response, to an input prompt. Other examples in which generative artificial intelligence models can be used include a latent diffusion model, in which a model generates an image or stream of images (e.g., video content) from an input text description of the content of the desired image or stream of images, decision transformers, in which future actions are predicted based on sequences of prior actions within a given environment, or the like. Generally, generative artificial intelligence models have many (e.g., millions or billions) of parameters, resulting in models that are large in size and incur a significant computational expense to train the model. Further, once trained, generative artificial intelligence models are often difficult (or impossible) to fine-tune, as the vast number of parameters makes overfitting (where the model fits too closely to the training data, resulting in loss of accuracy and generalization for runtime data) a major challenge (e.g., potentially relying on tremendous amounts of fine-tuning data to prevent overfitting). To allow for generative artificial intelligence models to be fine-tuned or modified, smaller model adapters may be trained for large models. For example, adapters may be trained to improve or enable video generation based on desired appearances, movement, and the like. BRIEF SUMMARY Certain aspects of the present disclosure provide a method for generating content using a generative artificial intelligence model. An example method generally includes receiving a request to generate an output of a machine learning model, the request specifying a plurality of attributes of the output of the machine learning model. A set of intermediate outputs is generated via a plurality of adapters of the machine learning model. Each respective adapter of the plurality of adapters may be associated with a respective attribute of the specified plurality of attributes and include a respective mask in a low-rank dimension associated with the respective adapter. The set of intermediate outputs is merged into a combined output of the plurality of adapters of the machine learning model, and the output of the machine learning model is generated based on the combined output of the plurality of adapters. Certain aspects of the present disclosure provide a method for training a generative artificial intelligence model to generate content. An example method generally includes receiving a first data set associated with a first attribute for which a machine learning model is to be trained and a second data set associated with a second attribute for which the machine learning model is to be trained. A first adapter of the machine learning model is trained to finetune outputs in accordance with the first attribute based on the first data set, the first adapter including a first mask in a low-rank dimension associated with the first adapter. A second adapter of the machine learning model is trained to finetune outputs in accordance with the second attribute based on the second data set, the second adapter including a second mask in a low-rank dimension associated with the second adapter, such that an output of the first adapter is orthogonal to an output of the second adapter. A merged adapter is trained, with the merged adapter comprising the trained first adapter and the trained second adapter. The machine learning model is deployed with the trained merged adapter. Certain aspects of the present disclosure provide a method for generating content using a generative artificial intelligence model. An example method generally includes receiving a request to generate an output of a machine learning model, the request specifying a plurality of attributes for the output of the machine learning model. A set of intermediate outputs is generated via a plurality of adapters of the machine learning model. Generally, each respective adapter of the plurality of adapters is associated with a respective attribute of the specified plurality of attributes and is trained based on a cycle-consistency loss between different attributes of the specified plurality of attributes. The set of intermediate outputs is merged into a combined output of the plurality of adapters of the machine learning model, and the output of the machine learning model is generated based on the combined output of the plurality of adapters. Certain aspects of the present disclosure provide a method for training a generative artificial intelligence model to generate content. An example method generally inclu