US-12626420-B2 - Segmentation free guidance in diffusion models

US12626420B2US 12626420 B2US12626420 B2US 12626420B2US-12626420-B2

Abstract

Certain aspects of the present disclosure provide techniques for generating an output image based on a text prompt. A method may include receiving the text prompt; providing a user interface comprising one or more input elements associated with one or more words of the text prompt; receiving input corresponding to at least one of the one or more input elements, the input indicating a semantic importance for each of at least one of the one or more words associated with the at least one of the one or more input elements; and generating the output image based on the text prompt and the input.

Inventors

Kambiz Azarian Yazdi
Fatih Murat PORIKLI
Qiqi Hou
Debasmit DAS

Assignees

QUALCOMM INCORPORATED

Dates

Publication Date: 20260512
Application Date: 20231116

Claims (20)

1 . An apparatus configured to generate an output image based on a text prompt, comprising: one or more memories configured to store a latent image representation; and one or more processors, coupled to the one or more memories, configured to: obtain the text prompt; encode the text prompt into a plurality of conditioning tokens; for each of a plurality of patches of the latent image representation: calculate a respective plurality of cross-attention weights corresponding to the plurality of conditioning tokens based on the patch as a query and the plurality of conditioning tokens as a key; and modify a maximum value cross-attention weight among the respective plurality of cross-attention weights to generate a modified respective plurality of cross-attention weights; perform an iteration of denoising using the modified respective plurality of cross-attention weights for each of the plurality of patches to obtain a modified latent image representation; and generate the output image based on the modified latent image representation.
2 . The apparatus of claim 1 , wherein to modify the maximum value cross-attention weight among the respective plurality of cross-attention weights comprises to reduce the maximum value cross-attention weight.
3 . The apparatus of claim 1 , wherein to modify the maximum value cross-attention weight among the respective plurality of cross-attention weights comprises to multiply the maximum value cross-attention weight by a negative scalar value.
4 . The apparatus of claim 3 , wherein the one or more processors are configured to receive the negative scalar value as a user-specified parameter.
5 . The apparatus of claim 1 , wherein to modify the maximum value cross-attention weight among the respective plurality of cross-attention weights comprises to set the maximum value cross-attention weight to zero.
6 . The apparatus of claim 1 , wherein the one or more processors are configured to perform one or more initial denoising iterations using classifier-free guidance prior to performing the iteration of denoising using the modified respective plurality of cross-attention weights for each of the plurality of patches.
7 . The apparatus of claim 1 , further comprising a display, coupled to the one or more processors, configured to display a user interface configured to receive input indicative of an emphasis strength associated with one or more words of the text prompt, wherein the emphasis strength controls an amount to modify the maximum value cross-attention weight among the respective plurality of cross-attention weights.
8 . The apparatus of claim 7 , wherein the user interface includes one or more interface elements configured to receive the input, the one or more interface elements including at least one of a slider, a numerical input, or a keyword highlight.
9 . The apparatus of claim 1 , wherein the one or more processors are configured to: for each of a plurality of patches of the modified latent image representation: calculate a respective second plurality of cross-attention weights corresponding to the plurality of conditioning tokens based on the patch as another query and the plurality of conditioning tokens as another key; and modify another maximum value cross-attention weight among the respective second plurality of cross-attention weights to generate another modified respective plurality of second cross-attention weights; and perform a second iteration of denoising using the modified respective plurality of second cross-attention weights for each of the plurality of patches of the modified latent image representation to obtain a second modified latent image representation, wherein to generate the output image based on the modified latent image representation comprises to decode the second modified latent image representation using a decoder to generate the output image.
10 . The apparatus of claim 1 , further comprising a display, coupled to the one or more processors, configured to display the output image.
11 . The apparatus of claim 1 , further comprising: a modem, coupled to the one or more processors, configured to modulate one or more carrier wave signals with data indicative of the output image.
12 . The apparatus of claim 11 , further comprising: one or more antennas, coupled to the modem, configured to transmit the one or more carrier wave signals to a device.
13 . The apparatus of claim 1 , wherein to perform the iteration of denoising, the one or more processors are configured to: generate a first output based on the latent image representation and the respective plurality of cross-attention weights for each of the plurality of patches of the latent image representation; generate a second output based on the latent image representation and the modified respective plurality of cross-attention weights for each of the plurality of patches of the latent image representation; and subtract the second output from the first output to obtain the modified latent image representation.
14 . The apparatus of claim 1 , wherein to perform the iteration of denoising, the one or more processors are configured to: generate the modified latent image representation based on the latent image representation and the modified respective plurality of cross-attention weights for each of the plurality of patches of the latent image representation.
15 . An apparatus configured to generate an output image based on a text prompt, comprising: one or more memories configured to store the output image; and one or more processors, coupled to the one or more memories, configured to: receive the text prompt; provide a user interface comprising one or more input elements associated with one or more words of the text prompt; receive input corresponding to at least one of the one or more input elements, the input indicating a semantic importance for each of at least one of the one or more words associated with the at least one of the one or more input elements; and generate the output image based on the text prompt and the input.
16 . The apparatus of claim 15 , further comprising a display, coupled to the one or more processors, configured to: display the user interface; and display the output image.
17 . The apparatus of claim 15 , wherein to generate the output image, the one or more processors are configured to generate the output image using the text prompt and the input as inputs to a generative artificial intelligence (AI) model.
18 . The apparatus of claim 15 , wherein a first input element of the one or more input elements is a slider element configured to increase or decrease the importance of a first word of the one or more words.
19 . The apparatus of claim 15 , wherein a first input element of the one or more input elements is a dial element configured to increase or decrease the importance of a first word of the one or more words.
20 . The apparatus of claim 15 , wherein the one or more processors are configured to modify an appearance of the at least one of the one or more words based on the indicated semantic importance.

Description

FIELD OF THE DISCLOSURE Aspects of the present disclosure relate to image generation techniques using diffusion models. DESCRIPTION OF RELATED ART Diffusion models are a class of generative deep learning models that are capable of generating high fidelity images from text prompts. The underlying principle involves the model adding noise to training data, such as an original image, and learning to recover the data by reversing the noising process. For example, during a training phase, the model may add noise to an image of an apple, and learn to recover the image of an apple from noise (e.g., noisy data that may be generated randomly). Accordingly, the model learns to generate an apple from noise. The model may be trained on many images of many different objects, and accordingly learn to generate many different images of many different objects. During inference, to generate an image from a text prompt, the diffusion model starts with and image of noise (e.g., random data) and can then iteratively denoise the image of noise to eventually generate an output image aligned closer to the text prompt. During the iterative process, the diffusion model generates latent image representations that represent the output image at each stage of the iteration. Although diffusion models can produce a wide variety of images, they sometimes fall short in achieving true fidelity to the given prompt. To address this, various methods, termed guidance or conditioning techniques, have been devised. These strategies enhance image fidelity, as in quality, but may reduce their diversity, as in coverage. One common guidance method is classifier guidance, where a separate classifier model is trained alongside the diffusion model. The classifier provides gradients and likelihoods that steer the diffusion model to remain consistent and produce images more in line with the textual prompt. However, introducing an additional classifier model adds computational complexity and latency. Thus, classifier-free guidance was introduced to simplify the process by using the diffusion model itself as the classifier. This is performed by running the diffusion model twice-once with the full prompt and once with an empty prompt to generate two conditional scores. The contrasting conditional scores, reflecting residual noise levels from the two iterations, then serve as a guiding signal. While existing methods produce images adhering to input text prompts, limitations remain in localized fidelity and semantic consistency. The generated images often contain unrelated or contradictory visual features in different spatial regions. This reduces quality and alignment with input textual concepts. SUMMARY One aspect provides a method for generating an output image based on a text prompt. The method includes obtaining the text prompt; encoding the text prompt into a plurality of conditioning tokens; for each of one or more patches of a latent image representation: calculating a respective plurality of cross-attention weights corresponding to the plurality of conditioning tokens based on the patch as a query and the plurality of conditioning tokens as a key; and modifying a maximum value cross-attention weight among the respective plurality of cross-attention weights to generate a modified respective plurality of cross-attention weights; performing an iteration of denoising using the modified respective plurality of cross-attention weights for each of the one or more patches to obtain a modified latent image representation; and generating the output image based on the modified latent representation. Another aspect provides a method for generating an output image based on a text prompt. The method includes receiving the text prompt; providing a user interface comprising one or more input elements associated with one or more words of the text prompt; receiving input corresponding to at least one of the one or more input elements, the input indicating a semantic importance for each of at least one of the one or more words associated with the at least one of the one or more input elements; and generating the output image based on the text prompt and the input. Other aspects provide: an apparatus operable, configured, or otherwise adapted to perform any one or more of the aforementioned methods and/or those described elsewhere herein; a non-transitory, computer-readable media comprising instructions that, when executed by a processor of an apparatus, cause the apparatus to perform the aforementioned methods as well as those described elsewhere herein; a computer program product embodied on a computer-readable storage medium comprising code for performing the aforementioned methods as well as those described elsewhere herein; and/or an apparatus comprising means for performing the aforementioned methods as well as those described elsewhere herein. By way of example, an apparatus may comprise a processing system, a device with a processing system, or processing systems