US-12626431-B2 - Utilizing machine learning models to generate image editing directions in a latent space

US12626431B2US 12626431 B2US12626431 B2US 12626431B2US-12626431-B2

Abstract

The present disclosure relates to systems, non-transitory computer-readable media, and methods for utilizing machine learning models to generate modified digital images. In particular, in some embodiments, the disclosed systems generate image editing directions between textual identifiers of two visual features utilizing a language prediction machine learning model and a text encoder. In some embodiments, the disclosed systems generated an inversion of a digital image utilizing a regularized inversion model to guide forward diffusion of the digital image. In some embodiments, the disclosed systems utilize cross-attention guidance to preserve structural details of a source digital image when generating a modified digital image with a diffusion neural network.

Inventors

Yijun Li
Richard Zhang
Krishna Kumar Singh
Jingwan Lu
Gaurav Parmar
Jun-Yan Zhu

Assignees

ADOBE INC.

Dates

Publication Date: 20260512
Application Date: 20230303

Claims (20)

1 . A computer-implemented method comprising: generating, utilizing a language prediction machine learning model, for a digital image: a first plurality of phrases based on a first textual identifier of a source visual feature of the digital image, the first plurality of phrases comprising the first textual identifier; and a second plurality of phrases based on a second textual identifier of a target visual feature, the second plurality of phrases comprising the second textual identifier; generating, utilizing a text encoder, a source embedding of the first plurality of phrases comprising the first textual identifier and a target embedding of the second plurality of phrases comprising the second textual identifier; and determining an image editing direction between the source visual feature and the target visual feature by comparing the source embedding of the first plurality of phrases comprising the first textual identifier and the target embedding of the second plurality of phrases comprising the second textual identifier.
2 . The computer-implemented method of claim 1 , wherein generating the first plurality of phrases comprises determining the first textual identifier for the source visual feature by extracting the source visual feature from a source digital image utilizing a vision-language machine learning model.
3 . The computer-implemented method of claim 1 , wherein generating the second plurality of phrases comprises determining the target visual feature based on the source visual feature.
4 . The computer-implemented method of claim 1 , further comprising: receiving, from a client device, a natural language editing input; and identifying the source visual feature and the target visual feature from the natural language editing input.
5 . The computer-implemented method of claim 1 , wherein determining the image editing direction between the source visual feature and the target visual feature comprises determining a mean difference between embedded phrases of the source embedding and embedded phrases of the target embedding.
6 . The computer-implemented method of claim 1 , further comprising generating, from the image editing direction and a source digital image portraying the source visual feature, a modified digital image portraying the target visual feature utilizing a generative machine learning model.
7 . The computer-implemented method of claim 6 , further comprising: generating, utilizing the text encoder, a caption embedding of an image caption describing the source digital image; creating an image editing encoding by combining the caption embedding with the image editing direction; and generating the modified digital image portraying the target visual feature from the image editing encoding utilizing the generative machine learning model.
8 . The computer-implemented method of claim 6 , wherein generating the modified digital image portraying the target visual feature utilizing the generative machine learning model comprises generating the modified digital image portraying the target visual feature utilizing a diffusion neural network.
9 . A system comprising: one or more memory devices; and one or more processors coupled to the one or more memory devices that cause the system to perform operations comprising: determining a first textual identifier for a source visual feature within a source digital image; determining a second textual identifier for a target visual feature, the target visual feature comprising an edit to the source visual feature; generating, utilizing a language prediction model, for the source digital image: a first plurality of phrases based on the first textual identifier, the first plurality of phrases comprising the first textual identifier, and a second plurality of phrases based on the second textual identifier, the second plurality of phrases comprising the second textual identifier; generating, utilizing a text encoder, a source embedding of the first plurality of phrases comprising the first textual identifier and a target embedding of the second plurality of phrases comprising the second textual identifier; and determining an image editing direction between the source visual feature and the target visual feature by comparing the source embedding of the first plurality of phrases comprising the first textual identifier and the target embedding of the second plurality of phrases comprising the second textual identifier.
10 . The system of claim 9 , wherein determining the second textual identifier comprises: receiving, via natural language input from a client device, instructions to edit the source visual feature; and analyzing the natural language input to determine the target visual feature and the second textual identifier.
11 . The system of claim 9 , wherein determining the first textual identifier comprises: generating, utilizing a vision-language machine learning model, an image caption describing the source digital image; and identifying the first textual identifier from the image caption.
12 . The system of claim 9 , wherein the operations further comprise generating, utilizing a generative machine learning model, a modified digital image portraying the target visual feature from the image editing direction and the source digital image.
13 . The system of claim 12 , wherein generating the modified digital image comprises: generating, utilizing a vision-language machine learning model, an image caption describing the source digital image; generating, utilizing the text encoder, a caption embedding of the image caption; and generating, utilizing the generative machine learning model, the modified digital image based on the caption embedding and the image editing direction.
14 . The system of claim 13 , wherein generating the modified digital image based on the caption embedding and the image editing direction comprises combining the caption embedding and the image editing direction with an inversion of the source digital image utilizing a diffusion neural network.
15 . A non-transitory computer readable medium storing instructions thereon that, when executed by at least one processor, cause the at least one processor to perform operations comprising: generating, utilizing a language prediction model, for a digital image: a first plurality of phrases based on a first textual identifier of a source visual feature of the digital image, the first plurality of phrases comprising the first textual identifier; and a second plurality of phrases based on a second textual identifier of a target visual feature, the second plurality of phrases comprising the second textual identifier; generating, utilizing a text encoder, a source embedding of the first plurality of phrases comprising the first textual identifier and a target embedding of the second plurality of phrases comprising the second textual identifier; and determining an image editing direction between the source visual feature and the target visual feature by comparing the source embedding of the first plurality of phrases comprising the first textual identifier and the target embedding of the second plurality of phrases comprising the second textual identifier.
16 . The non-transitory computer readable medium of claim 15 , wherein the operations further comprise: identifying the source visual feature within a source digital image; determining the first textual identifier for the source visual feature; and determining the second textual identifier for the target visual feature, the target visual feature comprising an edit to the source visual feature.
17 . The non-transitory computer readable medium of claim 15 , wherein determining the second textual identifier comprises receiving natural language input from a client device, the natural language input indicating an edit to the source visual feature.
18 . The non-transitory computer readable medium of claim 15 , wherein the operations further comprise: generating, utilizing a vision-language machine learning model, an image caption describing a source digital image portraying the source visual feature; and generating, utilizing the text encoder, a caption embedding of the image caption.
19 . The non-transitory computer readable medium of claim 18 , wherein the operations further comprise: combining the caption embedding with the image editing direction to create an image editing encoding; and generating, utilizing a diffusion neural network, a modified digital image portraying the target visual feature by combining the image editing encoding and an inversion of the source digital image.
20 . The non-transitory computer readable medium of claim 19 , wherein generating the modified digital image utilizing the diffusion neural network comprises: generating an inversion of the source digital image based on the caption embedding; denoising, utilizing a first channel of the diffusion neural network, the inversion of the source digital image; and generating the modified digital image by denoising, utilizing a second channel of the diffusion neural network, the inversion of the source digital image based on the image editing encoding with guidance from the denoising by the first channel of the diffusion neural network.

Description

BACKGROUND Recent years have seen significant improvements in hardware and software platforms for digital image processing and editing. For example, conventional systems have leveraged recent computing advancements to modify digital images utilizing a variety of digital tools and models. To illustrate, conventional systems utilize large-scale text-to-image generative models to synthesize digital images. Despite these advancements, however, conventional systems continue to suffer from a number of technical deficiencies, particularly with regard to accuracy, efficiency, and flexibility in generating and modifying digital images. BRIEF SUMMARY Embodiments of the present disclosure provide benefits and/or solve one or more of the foregoing or other problems in the art with systems, non-transitory computer-readable media, and methods for utilizing machine learning models to modify digital images. For example, in some embodiments, the disclosed systems utilize a regularized inversion model to increase the accuracy of inverted (embedded) digital images, improve the efficiency and flexibility of introducing modifications to inverted digital images, and thus increase the fidelity of modified digital images upon image reconstruction. Further, in some embodiments, the disclosed systems utilize an edit direction generation model to determine image editing directions between two visual features within an embedded space. Moreover, in some embodiments, the disclosed systems utilize a cross-attention guidance model to preserve structural details of digital images when generating modified digital images with a diffusion neural network. Additional features and advantages of one or more embodiments of the present disclosure are outlined in the description which follows, and in part will be obvious from the description, or may be learned by the practice of such example embodiments. BRIEF DESCRIPTION OF THE DRAWINGS The detailed description provides one or more embodiments with additional specificity and detail through the use of the accompanying drawings, as briefly described below. FIG. 1 illustrates a diagram of an environment in which an image modification system can operate in accordance with one or more embodiments. FIG. 2 illustrates an overview of an image modification system generating a modified digital image in accordance with one or more embodiments. FIG. 3 illustrates an image modification system inverting a digital image utilizing regularized forward diffusion in accordance with one or more embodiments. FIG. 4 illustrates an image modification system determining an auto-correlation regularization loss in accordance with one or more embodiments. FIG. 5 illustrates an image modification system generating an image editing direction in accordance with one or more embodiments. FIG. 6 illustrates an image modification system generating a reference encoding of a digital image in accordance with one or more embodiments. FIG. 7 illustrates an image modification system determining a cross-attention loss in accordance with one or more embodiments. FIG. 8 illustrates an image modification system generating a modified digital image utilizing diffusion-based editing in accordance with one or more embodiments. FIG. 9 illustrates an image modification system generating a modified digital image utilizing a diffusion neural network with a conditioning mechanism in accordance with one or more embodiments. FIGS. 10-13 illustrate comparative experimental results for an image modification system in accordance with one or more embodiments. FIG. 14 illustrates a schematic diagram of an image modification system in accordance with one or more embodiments. FIG. 15 illustrates a flowchart of a series of acts for generating a modified noise map in accordance with one or more embodiments. FIG. 16 illustrates a flowchart of a series of acts for generating an image editing direction in accordance with one or more embodiments. FIG. 17 illustrates a flowchart of a series of acts for generating a modified digital image with cross-attention guidance in accordance with one or more embodiments. FIG. 18 illustrates a block diagram of an example computing device for implementing one or more embodiments of the present disclosure. DETAILED DESCRIPTION This disclosure describes one or more embodiments of an image modification system that utilizes machine learning models to modify digital images. In particular, in one or more embodiments, the image modification system utilizes one or more of an edit direction generation model, a regularized inversion model, or a cross-attention guidance model as part of a generative machine learning approach to incorporate one or more edits into an embedded image space and generate a modified digital image. In some embodiments, for instance, the image modification system utilizes an edit direction generation model to determine an image editing direction between a source visual feature portrayed within a source digital imag