Search

KR-20260066072-A - Multistage distillation of a diffusion model by moment matching

KR20260066072AKR 20260066072 AKR20260066072 AKR 20260066072AKR-20260066072-A

Abstract

A method implemented as a system and computer program for training a diffusion model used to generate data frames, such as image data frames. The described technique trains or "distills" a student diffusion model using a teacher diffusion model, and this student diffusion model can generate data frames using significantly fewer steps than the teacher diffusion model, for example, fewer than 20 time steps. Data frames can be generated using the trained student diffusion model.

Inventors

  • 히크 조나단
  • 살리만스 팀
  • 후지붐 에미엘

Assignees

  • 지디엠 홀딩 엘엘씨

Dates

Publication Date
20260512
Application Date
20250520
Priority Date
20240522

Claims (20)

  1. As a computer implementation method for training a diffusion model, A step of obtaining a teacher diffusion model configured to generate a teacher prediction frame for noise removal of a noisy frame by processing a noisy frame corresponding to the diffusion time between an initial time and a final time; Step of initializing a student diffusion model - the student diffusion model is configured to generate a student prediction frame for noise removal of the noisy frame by processing a noisy frame corresponding to the diffusion time -; and It includes a step of training a student diffusion model using a teacher diffusion model, and the said training step is A step of acquiring training frames from a training dataset; A step for determining the target time and the training time within the duration of the time step after the target time; A step of determining a noisy training frame for training time by processing a training frame using a noise schedule that defines the variation of the frame noise level over time; A step of generating student prediction frames that define prediction noise-removed frames for the final time by processing noisy training frames using a student diffusion model; A step of determining a noise-reduced training frame corresponding to a target time by processing a noisy training frame and a predicted noise-removing frame for the end time using a noise schedule; and A method comprising the step of updating learnable parameters of a student diffusion model using a student model training objective that relies on a noise-removed frame and a noise-reduced training frame for the prediction of the final time.
  2. In paragraph 1, A computer-implemented method comprising the step of updating learnable parameters of the above-mentioned student diffusion model, the step of backpropagating the gradient of the student model training objective, and the step of applying a stopping gradient to a noise-reduced training frame so that the gradient is not transmitted to the student diffusion model through the noise-reduced training frame.
  3. In paragraph 1 or 2, A step of generating teacher-predicted noise-removed frames by processing noise-reducing training frames corresponding to target times using a teacher-difference model; and A computer-implemented method comprising the step of updating learnable parameters of a student diffusion model using a student model training objective that relies on a predicted noise-removing frame and a teacher predicted noise-removing frame for a final time.
  4. In any one of paragraphs 1 through 3, A step of maintaining an auxiliary diffusion model configured to generate an auxiliary prediction frame for noise removal of the noisy frame by processing a noisy frame corresponding to the diffusion time; A step of generating an auxiliary prediction frame that defines an auxiliary prediction noise-removed frame for the final time by processing a noise-reduced training frame using an auxiliary diffusion model; A step of generating teacher prediction frames that define teacher prediction noise-removed frames for final time by processing noise-reduced training frames using a teacher diffusion model; A step of updating the learnable parameters of an auxiliary diffusion model using an auxiliary model training objective that depends on the difference between the predicted denoising frame for the final time and the auxiliary model predicted denoising frame for the final time; and A computer-implemented method further comprising the step of updating learnable parameters of a student diffusion model using a student model training objective that relies on i) a prediction denoising frame for the final time and ii) the difference between an auxiliary prediction denoising frame for the final time and a teacher prediction denoising frame for the final time.
  5. In paragraph 4, A computer-implemented method comprising the step of updating learnable parameters of the above-mentioned student diffusion model, the step of backpropagating a gradient of the student model training objective, and the step of applying a stopping gradient to the difference between an auxiliary prediction denomination frame for the final time and a teacher prediction denomination frame for the final time.
  6. In paragraph 4 or 5, A computer-implemented method further comprising the step of normalizing an auxiliary model training objective using the difference between an auxiliary prediction denoising frame for the final time and a teacher model prediction denoising frame for the final time.
  7. In any one of paragraphs 1 through 3, A step of obtaining a first training frame and a second training frame from a training dataset; A step of determining a noisy first training frame to be used during the training time by processing a first training frame using a noise schedule; A step of determining a noisy second training frame to be used during the training time by processing a second training frame using a noise schedule; A step of generating a first student prediction frame that defines a first prediction noise-removed frame for a final time by processing a noisy first training frame using a student diffusion model; A step of generating a second student prediction frame that defines a second prediction noise-removed frame for the final time by processing a noisy second training frame using a student diffusion model; A step of determining a second noise reduction training frame corresponding to a target time by processing a noisy second training frame and a second predicted noise removal frame for the final time using a noise schedule; A step of generating auxiliary prediction frames by processing a second noise-reduced training frame using a teacher diffusion model; A step of determining a teacher model gradient of an auxiliary objective function that depends on the difference between a second predicted noise-removing frame and an auxiliary predicted noise-removing frame for the final time; and A computer-implemented method comprising the step of updating learnable parameters of a student diffusion model using a student model training objective that depends on a first prediction noise removal frame and a teacher model gradient for a final time.
  8. In Paragraph 7, A computer-implemented method comprising the step of determining a student model training objective that depends on the product of a vector proportional to a first prediction noise removal frame and a teacher model gradient for the final time.
  9. In Article 7 or Article 8, A step of determining a first noise reduction training frame corresponding to a target time by processing a noisy first training frame and a first predicted noise removal frame for the final time using a noise schedule; Step of determining the teacher model gradient for the learned parameters of the teacher diffusion model; A step of determining a scaled teacher model gradient by multiplying the teacher model gradient by a scaling matrix; A step of determining the Jacobian of the learned parameters of the teacher diffusion model evaluated for the first noise reduction training frame; A step of obtaining a training product by determining the product of the Jacobian of the learned parameters of the teacher diffusion model and the scaled teacher model gradient; and A computer-implemented method further comprising the step of updating learnable parameters of a student diffusion model using a student model training objective that depends on the product of a first prediction noise removal frame and a training product for the final time.
  10. In Paragraph 9, A computer-implemented method in which the step of updating learnable parameters of a student diffusion model includes the step of applying a stopping gradient to the training product.
  11. In any one of paragraphs 7 through 10, A step of obtaining a batch of a first training frame from a training dataset; A step of obtaining a batch of a second training frame from a training dataset; A step of determining student model training goals from the placement of the first training frame; and A computer-implemented method comprising the step of determining the gradient of a teacher model from the arrangement of a second training frame.
  12. In any one of paragraphs 1 through 11, A computer-implemented method comprising the step of: the above student prediction frame includes a prediction noise removal frame for the final time; and processing a noisy training frame using a student diffusion model to generate a prediction noise removal frame for the final time.
  13. In any one of paragraphs 1 through 12, A computer-implemented method comprising, wherein each of the above teacher diffusion model and student diffusion model includes an individual noise removal neural network configured to generate an individual teacher prediction frame or a student prediction frame by processing a diffusion time and a noisy frame corresponding to the diffusion time.
  14. As a computer implementation method for training a diffusion model, A step of obtaining a teacher diffusion model configured to generate a teacher prediction frame for noise removal of a noisy frame by processing a noisy frame corresponding to the diffusion time between an initial time and a final time; Step of initializing a student diffusion model - the student diffusion model is configured to generate a student prediction frame for noise removal of the noisy frame by processing a noisy frame corresponding to the diffusion time -; and A computer-implemented method comprising the step of training a student diffusion model using a teacher diffusion model by matching one or more statistical moments of the distribution of a prediction frame from the student diffusion model to one or more statistical moments of the distribution of a prediction frame from the teacher diffusion model.
  15. In Paragraph 14, A computer-implemented method comprising the step of training a student diffusion model using a moment matching objective that is minimized when training the teacher diffusion model on the prediction frames from the student diffusion model does not change the learned parameters of the teacher diffusion model, depending on the distribution of the prediction frames from the student diffusion model and the distribution of the prediction frames from the teacher diffusion model.
  16. In paragraph 14 or 15, It includes a step of acquiring training frames from a training dataset; and A computer-implemented method comprising the step of matching one or more statistical moments, wherein the step of matching a statistical moment defined by a noise removal loss determined by the difference between a first prediction frame generated by a student diffusion model and a second prediction frame generated by a teacher diffusion model.
  17. In Paragraph 16, A step of obtaining a first prediction frame by processing a noisy version of a training frame using a student diffusion model; A computer-implemented method comprising the step of obtaining a second prediction frame from a noisy version of a training frame according to a noise schedule that defines a variation in the frame noise level over time and a first prediction frame.
  18. In Paragraph 17, The step of acquiring the second prediction frame above is, A step of determining a noise-reduced training frame from a first prediction frame and a noisy version of a training frame according to a noise schedule; and A computer-implemented method comprising the step of generating a second prediction frame by processing a noise-reduced training frame using a teacher diffusion model.
  19. In paragraph 14 or 15, It includes a step of acquiring training frames from a training dataset; The step of matching one or more statistical moments includes the step of matching statistical moments defined by a noise removal loss determined by the difference between a first prediction frame generated by a student diffusion model and a second prediction frame generated using an auxiliary diffusion model, and the method comprises: A computer implementation method comprising the step of further training an auxiliary diffusion model to generate a prediction frame that matches the prediction frame generated by the teacher diffusion model.
  20. In Paragraph 19, A step of determining a noise-reduced training frame from a prediction frame from a teacher diffusion model generated by processing a training frame and a noisy version of the training frame; A step of generating auxiliary prediction frames by processing noise-reduced training frames using an auxiliary diffusion model; A step of generating teacher prediction frames by processing noisy training frames using a teacher diffusion model; and A computer-implemented method comprising the step of training an auxiliary diffusion model using a goal that relies on the difference between an auxiliary prediction frame and a teacher prediction frame.

Description

Multistage distillation of a diffusion model by moment matching This specification relates to processing data using machine learning models. A neural network is a machine learning model that employs one or more layers of non-linear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to the output layer. The output of each hidden layer is used as an input for the next layer of the network, that is, the next hidden layer or output layer. Each layer of the network generates an output from the received input based on the input of the current values of an individual set of parameters. Figure 1 illustrates an exemplary system for training a diffusion model. Figure 2 is a flowchart of an exemplary process for training a diffusion model. Figure 3 is a flowchart of an exemplary implementation of the process of Figure 2. Figure 4 illustrates the operation of a process for training a diffusion model. Figure 5 is a flowchart of an exemplary process for generating a data frame using a trained diffusion model. FIG. 6 illustrates an example of an image generated by a trained diffusion model as described in the present application. In various drawings, the same reference number and name represent the same element. This specification describes a computer implementation method for training a diffusion model, particularly a student diffusion model, using a teacher diffusion model. This technique is generally referred to as distillation. The teacher diffusion model is configured to generate a noisy frame corresponding to the diffusion time, and, depending on the implementation, also to generate a teacher prediction frame for removing noise from the noisy frame by processing the diffusion time. Processing the diffusion time can improve the quality of the teacher prediction frame. Generally, a prediction frame is a data frame generated by the diffusion model described in this application, in particular by the neural network of the diffusion model, when the diffusion model processes the input data frame. Typically, the prediction data frame has the same dimensions as the input data frame. In this application, references to sampling or processing values of a data frame refer to sampling or processing values of data elements defined by the data frame. In some implementations, data elements defined by a data frame may be, for example, pixel values of an image frame, audio signal values of an audio frame (e.g., instantaneous amplitude values), etc. In some implementations, instead of operating in an output space such as pixel space, teacher and student models may operate in a latent space. That is, the teacher diffusion model and the student diffusion model may be latent diffusion models. The data elements defined by the data frame may, in this case, be latent representations such as, for example, pixel values of an image frame or audio signal values of an audio frame. The technology described in this application can be used without modification whether it operates in an output space, i.e., a space where output data items are generated, or in a potential space, i.e., a potential representation space of the output space. As described in the present application, a trained diffusion model, in particular a student diffusion model, can be used to generate a data frame by performing a series of denoising steps. The generated data frame may be a data frame of an output space, i.e., thus, the values of the output data items may be values of a suitable type of data item, e.g., image pixel values, amplitude values of an audio signal, etc., or the generated data frame may be a data frame of a latent space, i.e., thus, the values of the data frame may be values of a latent representation of an output data item in the output space. Typically, such a latent space is lower in dimensionality than the output space. When generating data frames in the latent space, the system described in this application can generate final data frames in the output space by processing data frames in the latent space using a decoder neural network, for example, a decoder neural network pre-trained in an auto-encoder framework. During training, the system can encode training data items in the output space using an encoder neural network, for example, a neural network pre-trained with a decoder in an auto-encoder framework, thereby generating training data items for a diffusion neural network in the latent space. In the implementation of the described system, the data frame is a latent representation, and the values of the representation are, for example, learned latent values rather than pixel values when the data frame represents an image. Generally, a prediction data frame from a teacher diffusion model or a student diffusion model can be used to generate a noise-reduced version of the input data frame. For example, in some implementations, the teacher prediction frame includes a prediction of the noise i