CN-122029547-A - Hardware-aware efficient architecture for text-to-image diffusion models

CN122029547ACN 122029547 ACN122029547 ACN 122029547ACN-122029547-A

Abstract

A processor-implemented method includes receiving text semantic input at a first level of a neural network, the first level including a first convolution block and not including an attention layer. The method receives a first output from the first stage at the second stage. The second stage includes a first downsampling block that includes a first attention layer and a second convolution block. The method receives a second output from the second stage at the third stage. The third stage includes a first upsampling block comprising a second attention layer and a first set of convolution blocks. The method receives at a fourth stage a first output from the first stage and a third output from the third stage. The fourth stage includes a second upsampling block that does not include the attention layer and includes a second set of convolution blocks. The method generates an image at a fourth level based on the text semantic input.

Inventors

S. M. Boser
R. Gary Pali
HOU QIQI
ZHENG ZHIXIU
S. Kardambi
M. Hayat
F.M. Polykeri

Assignees

高通股份有限公司

Dates

Publication Date: 20260512
Application Date: 20240904
Priority Date: 20231023

Claims (20)

1. An apparatus, the apparatus comprising: At least one memory, and At least one processor coupled to the at least one memory, the at least one processor configured to: Receiving text semantic input at a first level of a neural network, the first level including a first convolution block and not including an attention layer; Receiving a first output from the first stage at a second stage, the second stage comprising a first downsampling block comprising a first attention layer and a second convolution block; Receiving a second output from the second stage at a third stage, the third stage comprising a first upsampling block comprising a second attention layer and a first set of convolution blocks; receiving the first output from the first stage and a third output from the third stage at a fourth stage, the fourth stage comprising a second upsampling block, the second upsampling block not comprising an attention layer and comprising a second set of convolution blocks, and An image is generated at the fourth level based on the text semantic input.
2. The apparatus of claim 1, wherein the neural network comprises a diffusion-based text-to-image generation model.
3. The apparatus of claim 1, wherein the neural network comprises UNet.
4. The apparatus of claim 1, wherein the first stage comprises a first additional convolution block, the second stage comprises a second additional convolution block, the third stage comprises a third additional convolution block, and the fourth stage comprises a fourth additional convolution block.
5. The apparatus of claim 4, wherein the at least one processor is further configured to: Training the neural network to obtain a converged neural network, and Training a pruned neural network based on the converged neural network.
6. The apparatus of claim 5, wherein the converged neural network comprises a teacher neural network and the pruned neural network comprises a student neural network, the at least one processor being further configured to train the student neural network based on block-wise error calculations of each stage of the student neural network relative to a same stage of the teacher neural network.
7. A processor-implemented method, the processor-implemented method comprising: Receiving text semantic input at a first level of a neural network, the first level including a first convolution block and not including an attention layer; Receiving a first output from the first stage at a second stage, the second stage comprising a first downsampling block comprising a first attention layer and a second convolution block; Receiving a second output from the second stage at a third stage, the third stage comprising a first upsampling block comprising a second attention layer and a first set of convolution blocks; receiving the first output from the first stage and a third output from the third stage at a fourth stage, the fourth stage comprising a second upsampling block, the second upsampling block not comprising an attention layer and comprising a second set of convolution blocks, and An image is generated at the fourth level based on the text semantic input.
8. The method of claim 7, wherein the neural network comprises a diffusion-based text-to-image generation model.
9. The method of claim 7, wherein the neural network comprises UNet.
10. The method of claim 7, wherein the first stage comprises a first additional convolution block, the second stage comprises a second additional convolution block, the third stage comprises a third additional convolution block, and the fourth stage comprises a fourth additional convolution block.
11. The method of claim 10, the method further comprising: Training the neural network to obtain a converged neural network, and Training a pruned neural network based on the converged neural network.
12. The method of claim 11, wherein the converged neural network comprises a teacher neural network and the pruned neural network comprises a student neural network, the method further comprising training the student neural network based on block-wise error calculations of each stage of the student neural network relative to a same stage of the teacher neural network.
13. A non-transitory computer readable medium having program code recorded thereon, the program code being executed by a processor and comprising: program code to receive text semantic input at a first level of a neural network, the first level including a first convolution block and not including an attention layer; Program code to receive a first output from the first stage at a second stage, the second stage comprising a first downsampling block comprising a first attention layer and a second convolution block; Program code to receive a second output from the second stage at a third stage, the third stage comprising a first upsampling block comprising a second attention layer and a first set of convolution blocks; program code to receive the first output from the first stage and a third output from the third stage at a fourth stage, the fourth stage comprising a second upsampling block, the second upsampling block not comprising an attention layer and comprising a second set of convolution blocks, and Program code to generate an image at the fourth level based on the text semantic input.
14. The non-transitory computer-readable medium of claim 13, wherein the neural network comprises a diffusion-based text-to-image generation model.
15. The non-transitory computer-readable medium of claim 13, wherein the neural network comprises UNet.
16. The non-transitory computer-readable medium of claim 13, wherein the first stage comprises a first additional convolution block, the second stage comprises a second additional convolution block, the third stage comprises a third additional convolution block, and the fourth stage comprises a fourth additional convolution block.
17. The non-transitory computer-readable medium of claim 16, wherein the program code further comprises: program code to train the neural network to obtain a converged neural network, and Program code to train the pruned entire neural network based on the converged neural network.
18. The non-transitory computer-readable medium of claim 17, wherein the converged neural network comprises a teacher neural network and the pruned neural network comprises a student neural network, the program code further comprising program code to train the student neural network based on block-wise error calculations of each stage of the student neural network relative to a same stage of the teacher neural network.
19. An apparatus, the apparatus comprising: Means for receiving text semantic input at a first level of a neural network, the first level including a first convolution block and not including an attention layer; Means for receiving a first output from the first stage at a second stage, the second stage comprising a first downsampling block comprising a first attention layer and a second convolution block; means for receiving a second output from the second stage at a third stage, the third stage comprising a first upsampling block comprising a second attention layer and a first set of convolution blocks; Means for receiving at a fourth stage the first output from the first stage and a third output from the third stage, the fourth stage comprising a second upsampling block, the second upsampling block not comprising an attention layer and comprising a second set of convolution blocks, and Means for generating an image at the fourth level based on the text semantic input.
20. The apparatus of claim 19, wherein the neural network comprises a diffusion-based text-to-image generation model.

Description

Hardware-aware efficient architecture for text-to-image diffusion models Cross Reference to Related Applications The present application claims priority from U.S. patent application Ser. No. 18/492,572, entitled "HARDWARE-AWARE EFFICIENT ARCHITECTURES FOR TEXT-TO-IMAGE DIFFUSION MODELS," filed on even 23 at 10/2023, the disclosure of which is expressly incorporated herein by reference in its entirety. Technical Field Aspects of the present disclosure relate generally to machine learning, and more particularly to a hardware-aware efficient architecture for text-to-image diffusion models. Background An artificial neural network may include an interconnected set of artificial neurons (e.g., a neuron model). An Artificial Neural Network (ANN) may be a computing device or a method represented to be performed by a computing device. Convolutional Neural Networks (CNNs) are one type of feedforward ANN. The convolutional neural network may include a set of neurons, where each neuron has a receptive field and commonly spells out an input space. Convolutional neural networks, such as deep convolutional neural networks (DCNs), have numerous applications. In particular, these neural network architectures are used for various technologies such as image recognition, speech recognition, acoustic scene classification, keyword retrieval, autopilot, and other classification tasks. A diffusion model is a generative model designed to transform easily generated data into more complex and realistic data through a series of reversible transformations. For example, the diffusion model may generate images from text cues. The diffusion model may employ a CNN and an attention layer that may help memorize large sequences by focusing attention to specific portions of data. Due to the complex calculations specified by the CNN and the attention layer of the diffusion model, the diffusion model requires a large amount of computational effort, resulting in a tradeoff between training time and quality of the generated data. Disclosure of Invention Aspects of the present disclosure relate to an apparatus. The apparatus includes one or more memories and one or more processors coupled to the one or more memories. The processor is configured to receive text semantic input at a first level of the neural network. The first stage includes a first convolution block and does not include an attention layer. The processor is further configured to receive a first output from the first stage at the second stage. The second stage includes a first downsampling block that includes a first attention layer and a second convolution block. The processor is further configured to receive a second output from the second stage at the third stage. The third stage includes a first upsampling block comprising a second attention layer and a first set of convolution blocks. The processor is still further configured to receive at a fourth stage the first output from the first stage and the third output from the third stage. The fourth stage includes a second upsampling block that does not include the attention layer and includes a second set of convolution blocks. The processor is further configured to generate an image at a fourth level based on the text semantic input. In other aspects of the disclosure, a method includes receiving text semantic input at a first stage of a neural network. The first stage includes a first convolution block and does not include an attention layer. The method also includes receiving, at the second stage, a first output from the first stage. The second stage includes a first downsampling block that includes a first attention layer and a second convolution block. The method further includes receiving a second output from the second stage at the third stage. The third stage includes a first upsampling block comprising a second attention layer and a first set of convolution blocks. The method still further includes receiving at a fourth stage a first output from the first stage and a third output from the third stage. The fourth stage includes a second upsampling block that does not include the attention layer and includes a second set of convolution blocks. The method also includes generating an image at a fourth level based on the text semantic input. In other aspects of the disclosure, a non-transitory computer readable medium having program code recorded thereon is disclosed. The program code is executed by the processor and includes program code to receive text semantic input at a first level of the neural network. The first stage includes a first convolution block and does not include an attention layer. The program code also includes program code to receive, at the second stage, a first output from the first stage. The second stage includes a first downsampling block that includes a first attention layer and a second convolution block. The program code further includes program code to receive a second output from the second stage at th