CN-122029548-A - Distillation for guided diffusion model

CN122029548ACN 122029548 ACN122029548 ACN 122029548ACN-122029548-A

Abstract

A method for training a diffusion model includes randomly selecting a teacher model of a set of teacher models for each iteration of a step-wise distillation training process. The method also includes applying a cropped input space within the step-wise distillation of the randomly selected teacher model at each iteration. The method further includes updating parameters of the diffusion model based on the guidance from the randomly selected teacher model at each iteration.

Inventors

R. Gary Pali
S. M. Boser
J.Zheng
HOU QIQI
S. Kadabi
M. Hayat
F. M. Policles

Assignees

高通股份有限公司

Dates

Publication Date: 20260512
Application Date: 20240829
Priority Date: 20231023

Claims (20)

1. An apparatus for training a diffusion model, the apparatus comprising: One or more processors, and One or more memories coupled with the one or more processors and storing instructions that, when executed by the one or more processors, are operable to cause the apparatus to: selecting a teacher model of a group of teacher models randomly for each iteration of the step number distillation training process; applying cropped input space within step distillation of a randomly selected teacher model at each iteration, and Parameters of the diffusion model are updated at each iteration based on guidance from the randomly selected teacher model.
2. The apparatus of claim 1, wherein the set of teacher models includes a guided conditional teacher model and a classifier-free guided (CFG) teacher model.
3. The apparatus of claim 1, further comprising applying signal-to-noise ratio (SNR) loss according to a schedule during the step distillation training process.
4. The device of claim 3, wherein the schedule disables the SNR loss for at least a first iteration or a first gradient update within the step distillation training process.
5. The apparatus of claim 1, wherein execution of the instructions further causes the apparatus to perform an end-to-end trimming process on the diffusion model to regularize the diffusion model at the end of the step distillation training process.
6. The apparatus of claim 5, wherein the end-to-end trimming process regularizes scoring function estimates.
7. The apparatus of claim 1, wherein execution of the instructions further causes the apparatus to perform diffusion reasoning based on training the diffusion model.
8. A method for training a diffusion model, the method comprising: selecting a teacher model of a group of teacher models randomly for each iteration of the step number distillation training process; applying cropped input space within step distillation of a randomly selected teacher model at each iteration, and Parameters of the diffusion model are updated at each iteration based on guidance from the randomly selected teacher model.
9. The method of claim 8, wherein the set of teacher models includes a guided conditional teacher model and a classifier-free guided (CFG) teacher model.
10. The method of claim 8, further comprising applying a signal-to-noise ratio (SNR) penalty according to a schedule during the step-wise distillation training process.
11. The method of claim 10, wherein the scheduling disables the SNR loss for at least a first iteration or a first gradient update within the step distillation training process.
12. The method of claim 8, further comprising performing an end-to-end trimming process on the diffusion model at the end of the step distillation training process to regularize the diffusion model.
13. The method of claim 12, wherein the end-to-end trimming process regularizes score function estimates.
14. The method of claim 8, further comprising performing diffusion reasoning based on training the diffusion model.
15. A non-transitory computer readable medium having program code recorded thereon for training a diffusion model, the program code being executed by a processor and comprising: program code for randomly selecting a teacher model of a set of teacher models for each iteration of the step number distillation training process; Program code for applying a cropped input space within step distillation of a randomly selected teacher model at each iteration, and Program code for updating parameters of the diffusion model based on the guidance from the randomly selected teacher model at each iteration.
16. The non-transitory computer-readable medium of claim 15, wherein the set of teacher models includes a guided conditional teacher model and a classifier-free guided (CFG) teacher model.
17. The non-transitory computer readable medium of claim 15, further comprising applying signal-to-noise ratio (SNR) loss according to a schedule during the step-wise distillation training process.
18. The non-transitory computer-readable medium of claim 17, wherein the schedule disables the SNR loss for at least a first iteration or a first gradient update within the step distillation training process.
19. The non-transitory computer-readable medium of claim 15, wherein execution of the instructions further causes the apparatus to perform an end-to-end trimming process on the diffusion model to regularize the diffusion model at the end of the step distillation training process.
20. The non-transitory computer-readable medium of claim 19, wherein the end-to-end trimming process regularizes score function estimates.

Description

Distillation for guided diffusion model Cross Reference to Related Applications The present application claims priority from U.S. patent application Ser. No. 18/492,508, filed on 10/23 2023, entitled "DISTILLATION FOR GUIDED DIFFUSION MODELS (distillation for guided diffusion model)", the disclosure of which is expressly incorporated by reference in its entirety. Technical Field Aspects of the present disclosure relate generally to improving distillation of guided diffusion models. Background An artificial neural network may include an interconnected set of artificial neurons (e.g., a neuron model). An Artificial Neural Network (ANN) may be a computing device or a method represented to be performed by a computing device. Convolutional Neural Networks (CNNs) are one type of feedforward ANN. The convolutional neural network may include a set of neurons, where each neuron has a receptive field and commonly spells out an input space. Convolutional neural networks, such as deep convolutional neural networks (DCNs), have numerous applications. In particular, these neural network architectures are used for various technologies such as image recognition, speech recognition, acoustic scene classification, keyword retrieval, autopilot, and other classification tasks. In machine learning and data generation, diffusion refers to a method of generating a model for transforming data by a reversible transformation sequence. These generative models may be referred to as diffusion models. During the diffusion process, the diffusion model begins with a distribution (typically a gaussian distribution) and gradually transforms the data into the desired data distribution, thereby facilitating tasks such as image synthesis and denoising. Diffusion models require significant computational resources such as power, memory, and/or processor load, resulting in a tradeoff between training time and quality of the generated data. Disclosure of Invention Some aspects of the present disclosure relate to a method for training a diffusion model, the method comprising compressing the diffusion model by removing one or more model parameters and/or one or more giga-multiply-add operations (GMACs). The method further includes performing a guided conditioning to train the compressed diffusion model, the guided conditioning combining the conditional output and the unconditional output from the respective teacher model. The method also includes performing step-wise distillation on the compressed diffusion model after the pilot conditioning. Some other aspects of the disclosure relate to an apparatus comprising means for compressing a diffusion model by removing one or more model parameters and/or one or more GMACs. The apparatus further includes means for performing a guided conditioning to train the compressed diffusion model, the guided conditioning combining the conditional output and the unconditional output from the respective teacher model. The apparatus also includes means for performing step-wise distillation on the compressed diffusion model after the pilot conditioning. In some other aspects of the present disclosure, a non-transitory computer readable medium having non-transitory program code recorded thereon is disclosed. The program code is executed by the processor and includes program code to compress the diffusion model by removing one or more model parameters and/or one or more GMACs. The program code also includes program code to perform a guide-conditioning to train the compressed diffusion model, the guide-conditioning combining the conditional output and the unconditional output from the respective teacher model. The program code also includes program code to perform step-wise distillation on the compressed diffusion model after the boot-strap conditioning. Additionally, some other aspects of the disclosure relate to an apparatus having one or more processors and one or more memories coupled with the one or more processors and storing instructions that, when executed by the one or more processors, are operable to cause the apparatus to compress a diffusion model by removing one or more model parameters and/or one or more GMACs. Execution of these instructions also causes the apparatus to perform a guided conditioning to train the compressed diffusion model, the guided conditioning combining the conditional output and the unconditional output from the respective teacher model. Execution of the instructions further causes the apparatus to perform step-wise distillation on the compressed diffusion model after the boot-strap conditioning. In some aspects of the disclosure, a method for training a diffusion model includes randomly selecting a teacher model of a set of teacher models for each iteration of a step-wise distillation training process. The method further includes applying a cropped input space within the step-wise distillation of the randomly selected teacher model at each iteration. The method also includes updating par