US-12620216-B2 - Latent diffusion model autodecoders

US12620216B2US 12620216 B2US12620216 B2US 12620216B2US-12620216-B2

Abstract

Described is a system for improving machine learning models. In some cases, the system improves such models by identifying an autoencoder for a latent diffusion machine learning model, the latent diffusion machine learning model is trained to receive text as input and output an image based on the received text. The system identifies a number of channels in a decoder of the autoencoder, the decoder being configured to receive latent features as input and output images. The system further identifies a performance characteristic of the decoder and changes the node topology of the decoder based on the performance characteristic to generate an updated decoder. The system retrains the latent diffusion machine learning model using the updated decoder by inputting latent features to the updated decoder, receiving an outputted image from the updated decoder, and updating one or more weights of the decoder based on an assessment of the outputted image.

Inventors

Pavlo Chemerys
Sergey Tulyakov
Huan WANG
Colin Eles
Ju Hu
Qing Jin
Yanyu Li
Ergeta Muca
Jian Ren
Dhritiman Sagar
Aleksei Stoliar

Assignees

SNAP INC.

Dates

Publication Date: 20260505
Application Date: 20231229

Claims (20)

1 . A system comprising: at least one processor; and at least one memory component storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising: identifying an autoencoder for a latent diffusion machine learning model, the latent diffusion machine learning model trained to receive text as input and output an image based on the received text; identifying a number of channels in a decoder of the autoencoder, the decoder configured to receive latent features as input and to output images; identifying a first performance characteristic of the decoder; changing a node topology of the decoder based on the first performance characteristic to generate an updated decoder; and retraining the latent diffusion machine learning model using the updated decoder by performing operations comprising: inputting latent features into the updated decoder; receiving an outputted image from the updated decoder; and updating one or more weights of the updated decoder based on assessment of the outputted image.
2 . The system of claim 1 , wherein the first performance characteristic includes a magnitude of the output for a particular channel.
3 . The system of claim 2 , wherein changing the node topology includes: determining that the magnitude for the particular channel is above a minimum threshold; and changing the node topology for the particular channel based on the determination that the magnitude is above the minimum threshold.
4 . The system of claim 1 , wherein the first performance characteristic includes a reconstruction loss that measures a loss of reconstruction by the autoencoder converting text to image data.
5 . The system of claim 1 , wherein changing the node topology includes removing one or more channels of the decoder.
6 . The system of claim 1 , the operations further comprising: identifying the first performance characteristic for each channel in the decoder, and wherein changing the node topology of the decoder comprises removing half of the channels with lowest values for the first performance characteristics.
7 . The system of claim 1 , wherein changing the node topology includes reducing channel dimensions in at least one layer of the decoder to obtain a compressed decoder, wherein the compressed decoder processes latent features with lower latency and fewer parameters than to decoder.
8 . The system of claim 1 , wherein changing the node topology includes removing or adding one or more channels of the decoder until two or more performance thresholds are met.
9 . The system of claim 8 , the operations further comprising: removing one or more channels in response to a first performance threshold being met, and adding one or more channels in response to a second performance threshold being met.
10 . The system of claim 8 , the operations further comprising: repeatedly adding and removing channels of the decoder until two performance thresholds are met.
11 . The system of claim 8 , wherein adding the one or more channels of the decoder includes identifying one or more existing channels of the decoder that meet a performance threshold for a second performance characteristic, and copying the one or more existing channels of the decoder to add as new channels to the decoder.
12 . The system of claim 11 , wherein the first performance characteristic and second performance characteristic are of a same type.
13 . The system of claim 11 , wherein the first performance characteristic and second performance characteristic are of different types.
14 . The system of claim 1 , wherein the assessment of the outputted image comprises comparing the outputted image of the updated decoder with an outputted image of the decoder, wherein the outputted image of the updated decoder and the outputted image of the decoder are generated using the same input data.
15 . The system of claim 14 , the operations further comprising: generating latent features by inputting random noise data into the latent diffusion machine learning model, wherein data that is input into the updated decoder and the decoder includes the latent features, the same input data including the random noise data.
16 . The system of claim 14 , wherein comparing the outputted image of the updated decoder with an outputted image of the decoder includes determining a mean squared error between the outputted image of the updated decoder and the outputted image of the decoder, and the operations further comprise: updating one or more weights of the updated decoder based on the mean squared error; and repeatedly inputting random noise data into the updated decoder with the updated weights, comparing the outputted image of the updated decoder with an outputted image of the decoder, and updating the weights of the updated decoder until the mean squared error between the outputted images meets a mean squared error threshold.
17 . The system of claim 1 , wherein the outputted image is a frame of a video, wherein the operations performed by the at least one processor are repeated to generate other frames for the video.
18 . The system of claim 1 , the operations further comprising: generating a prompt based on user interaction data; inputting the prompt into a latent feature generator of the latent diffusion machine learning model causing the latent feature generator to output latent features that are inputted into the updated decoder with the updated weights; and receiving an image generated by the updated decoder based on the latent features inputted into the updated decoder with the updated weights.
19 . A method comprising: identifying an autoencoder for a latent diffusion machine learning model, the latent diffusion machine learning model trained to receive text as input and output an image based on the received text; identifying a number of channels in a decoder of the autoencoder, the decoder configured to receive latent features as input and to output images; identifying a first performance characteristic of the decoder; changing a node topology of the decoder based on the first performance characteristic to generate an updated decoder; and retraining the latent diffusion machine learning model using the updated decoder by performing operations comprising: inputting latent features into the updated decoder; receiving an outputted image from the updated decoder; and updating one or more weights of the updated decoder based on assessment of the outputted image.
20 . A non-transitory computer-readable storage medium, the computer-readable storage medium including instructions that when executed by a computer, cause the computer to: identifying an autoencoder for a latent diffusion machine learning model, the latent diffusion machine learning model trained to receive text as input and output an image based on the received text; identifying a number of channels in a decoder of the autoencoder, the decoder configured to receive latent features as input and to output images; identifying a first performance characteristic of the decoder; changing a node topology of the decoder based on the first performance characteristic to generate an updated decoder; and retraining the latent diffusion machine learning model using the updated decoder by performing operations comprising: inputting latent features into the updated decoder; receiving an outputted image from the updated decoder; and updating one or more weights of the updated decoder based on assessment of the outputted image.

Description

CLAIM OF PRIORITY This application claims the benefit of priority to U.S. Provisional Application Ser. No. 63/504,563, filed on May 26, 2023, which is incorporated herein by reference in its entirety. TECHNICAL FIELD The present disclosure relates generally to machine learning models, and more specifically to text-to-image machine learning models. BACKGROUND As the popularity of Artificial Intelligence (AI) grows, companies use machine learning models in various ways, which is transforming how we process, analyze, and interact with visual data. The use of AI in image processing involves training algorithms, particularly deep learning models like Convolutional Neural Networks (CNNs), to perform tasks that range from low-level image manipulation to high-level understanding and generation of visual content. Some prominent applications of AI in images include image classification, object detection, image segmentation, facial recognition, and style transfer. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. To identify the discussion of any particular element or act more easily, the most significant digit or digits in a reference number refer to the figure number in which that element is first introduced. Some non-limiting examples are illustrated in the figures of the accompanying drawings in which: FIG. 1 is a diagrammatic representation of a networked environment in which the present disclosure may be deployed, according to some examples. FIG. 2 is a diagrammatic representation of an interaction system that has both client-side and server-side functionality, according to some examples. FIG. 3 is a diagrammatic representation of a data structure as maintained in a database, according to some examples. FIG. 4 illustrates an architecture for a latent diffusion model, according to some examples. FIG. 5 illustrates an example routine for changing the node topology of a decoder for latent diffusion machine learning models, according to some examples. FIG. 6 illustrates a latent diffusion machine learning model with an autoencoder to convert text to image, according to some examples. FIG. 7 illustrates an updated decoder with removed channels and newly added channels, according to some examples. FIG. 8 illustrates a comparison of the prior version of the decoder and the updated decoder for retraining, according to some examples. FIG. 9 illustrates removing and adding channels efficiently to quickly identify an optimal decoder architecture, according to some examples. FIG. 10 illustrates a modified decoder of a latent diffusion model, according to some examples. FIG. 11 is a diagrammatic representation of a message, according to some examples. FIG. 12 illustrates a system including a head-wearable apparatus with a selector input device, according to some examples. FIG. 13 is a diagrammatic representation of a machine in the form of a computer system within which a set of instructions may be executed to cause the machine to perform any one or more of the methodologies discussed herein, according to some examples. FIG. 14 is a block diagram showing a software architecture within which examples may be implemented. FIG. 15 illustrates a machine-learning pipeline, according to some examples. FIG. 16 illustrates training and use of a machine-learning program, according to some examples. DETAILED DESCRIPTION Text-to-image machine learning models generate images based on textual descriptions. These models utilize deep learning techniques and are trained on large datasets of paired text and image examples. During the training phase, the model learns the correlation between textual descriptions and corresponding images. Once trained, the model is used to generate images from textual descriptions. The generated image may not be an exact replica of the input description but captures the essence and key elements described in the text. Text-to-image diffusion models create images from natural language descriptions that rival the work of professional artists and photographers. However, these traditional models are large, with complex network architectures and require many denoising iterations, making them computationally expensive and slow to run. These models typically involve intricate network architectures and numerous denoising iterations, which increase computational complexity. Another challenge is high computational costs. Running such models requires high-end Graphics Processing Units (GPUs) and often relies on cloud-based inference, restricting scalability and accessibility. Using cloud-based inference also involves sending user data to third-party servers, which raises privacy concerns as sensitive information is exposed to external entities. Users may be hesitant to share their data, especially when dealing with personal or confidential content. This approach is costly and has privacy implicat