CN-121998853-A - Method, device, equipment and storage medium for training image generation model

CN121998853ACN 121998853 ACN121998853 ACN 121998853ACN-121998853-A

Abstract

Embodiments of the present disclosure relate to methods, apparatuses, devices, and storage medium for training an image generation model. The method includes performing a multi-round denoising process on an initial noise based on a sample prompt word to generate a reference image, determining a plurality of sampling results corresponding to the multi-round denoising process, determining a target noise, processing the reference image with an image generation model to determine a predicted noise of the image generation model at a target time step, wherein the target time step is determined by sampling from the plurality of time steps, and adjusting parameters of the image generation model based on a difference between the predicted noise and the target noise. In this way, embodiments of the present disclosure can improve the efficiency of model processing.

Inventors

XIA XIN
Shao Huiyang
XIAO XUEFENG

Assignees

北京字跳网络技术有限公司

Dates

Publication Date: 20260508
Application Date: 20241107

Claims (11)

1. A method of training an image generation model, comprising: performing a multi-round denoising process on the initial noise based on the sample prompt word to generate a reference image; determining a plurality of sampling results corresponding to the multi-round denoising process, and determining target noise; Processing the reference image with an image generation model to determine a prediction noise of the image generation model at a target time step, wherein the target time step is determined from a plurality of time steps, and Parameters of the image generation model are adjusted based on differences in the prediction noise and the target noise.
2. The method of claim 1, further comprising: Determining weight information corresponding to the plurality of time steps using a time step sampler, the weight information indicating the degree of influence of the plurality of time steps on the noise adding process, and The target time step is determined from the plurality of time steps based on the weight information.
3. The method of claim 2, wherein determining the target time step from the plurality of time steps based on the weight information comprises: Constructing a time distribution based on the weight information, and Sampling the target time step from the time distribution.
4. The method of claim 2, further comprising: The time-step sampler is trained based on a plurality of losses corresponding to the plurality of rounds of denoising process.
5. The method of claim 2, further comprising: the temporal step sampler is optimized based on the reference image, the prediction noise, and the sampled target temporal step.
6. The method of claim 1, wherein determining a plurality of sampling results corresponding to the multi-round denoising process, determining target noise comprises: the target noise is determined based on expected values of the plurality of sampling results.
7. The method of claim 1, wherein the initial noise is a first initial noise, the method further comprising: acquiring target prompt words, and And performing at least one round of denoising process on the second initial noise by using the trained image generation model based on the target prompt word so as to generate a target image.
8. The method of claim 1, wherein the image generation model is a diffusion model.
9. An apparatus for training an image generation model, comprising: a generation module configured to perform a multi-round denoising process on the initial noise based on the sample prompt word to generate a reference image; a determining module configured to determine a plurality of sampling results corresponding to the multi-round denoising process, and determine a target noise; a processing module configured to process the reference image with an image generation model to determine a prediction noise of the image generation model at a target time step, wherein the target time step is determined from a plurality of time steps, and An adjustment module configured to adjust parameters of the image generation model based on a difference of the prediction noise and the target noise.
10. An electronic device, comprising: At least one processing unit, and At least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, which when executed by the at least one processing unit, cause the electronic device to perform the method of any one of claims 1-8.
11. A computer readable storage medium having stored thereon a computer program executable by a processor to implement the method of any of claims 1 to 8.

Description

Method, device, equipment and storage medium for training image generation model Technical Field Example embodiments of the present disclosure relate generally to the field of computers and, more particularly, relate to a method, apparatus, device, and computer-readable storage medium for training an image generation model. Background With the development of computer technology, the generated artificial intelligence technology is gradually applied to various fields. For example, the diffusion model can generate content such as an image by gradually adding noise and removing noise. The processing efficiency of the model is a focus of attention for the generative model. Disclosure of Invention In a first aspect of the present disclosure, a method of training an image generation model is provided. The method includes performing a multi-round denoising process on initial noise based on a sample prompt word to generate a reference image, determining a plurality of sampling results corresponding to the multi-round denoising process, determining a target noise, processing the reference image using an image generation model to determine a predicted noise of the image generation model at a target time step, wherein the target time step is determined by sampling from the plurality of time steps, and adjusting parameters of the image generation model based on differences between the predicted noise and the target noise. In a second aspect of the present disclosure, an apparatus for training an image generation model is provided. The device comprises a generation module, a determination module, a processing module and an adjustment module, wherein the generation module is configured to execute a multi-round denoising process on initial noise based on sample prompt words to generate a reference image, the determination module is configured to determine a plurality of sampling results corresponding to the multi-round denoising process to determine target noise, the processing module is configured to process the reference image by utilizing an image generation model to determine predicted noise of the image generation model in a target time step, the target time step is determined by sampling from a plurality of time steps, and the adjustment module is configured to adjust parameters of the image generation model based on differences of the predicted noise and the target noise. In a third aspect of the present disclosure, an electronic device is provided. The apparatus includes at least one processing unit, and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit. The instructions, when executed by at least one processing unit, cause the apparatus to perform the method of the first aspect. In a fourth aspect of the present disclosure, a computer-readable storage medium is provided. The computer readable storage medium has stored thereon a computer program executable by a processor to implement the method of the first aspect. It should be understood that what is described in this section of the disclosure is not intended to limit key features or essential features of the embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description. Drawings The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, wherein like or similar reference numerals denote like or similar elements, in which: FIG. 1 illustrates a schematic diagram of an example environment in which embodiments in accordance with the present disclosure may be implemented; FIG. 2 illustrates a flowchart of an example process of training an image generation model in accordance with some embodiments of the present disclosure; FIG. 3 illustrates a comparative schematic of a diffusion process and a conventional diffusion process according to an embodiment of the present disclosure; FIG. 4 shows a schematic block diagram of an example apparatus for training an image generation model in accordance with some embodiments of the present disclosure, and Fig. 5 illustrates a block diagram of an electronic device capable of implementing various embodiments of the present disclosure. Detailed Description Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been illustrated in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather, these embodiments are provided so that this disclosure will be more thorough and complete. It should b