CN-122021971-A - Diffusion model quantization method and system based on pareto search and decoupling operator
Abstract
The application provides a diffusion model quantization method and a diffusion model quantization system based on pareto search and decoupling operators, wherein the method comprises the steps of performing full-time-sequence sensitivity analysis on a pre-trained diffusion model to obtain sensitivity weights of each network layer in the diffusion model at each time step; the method comprises the steps of constructing a decoupling candidate set based on bit width and operator configuration, obtaining real reasoning delay of the decoupling candidate set, constructing a delay lookup table, performing pareto search on the decoupling candidate set based on sensitivity weight and the delay lookup table to obtain a full-time-sequence optimal quantization strategy path, and performing quantization processing on a pre-trained diffusion model based on the full-time-sequence optimal quantization strategy path to obtain a quantized diffusion model. The quantized diffusion model can effectively support generation tasks of data such as images, videos and the like, avoid resource waste, and furthest reduce structural collapse in the data transmission and generation process.
Inventors
- ZHANG YULUN
- ZHANG SHAOQIU
- Ding Zizhong
- YANG KAICHENG
- WU JUNYI
- LIU RUONAN
- KONG LINGHE
- YANG XIAOKANG
Assignees
- 上海交通大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260126
Claims (10)
- 1. The diffusion model quantization method based on the pareto search and decoupling operator is characterized by comprising the following steps: performing sensitivity analysis on the full time sequence of the pre-trained diffusion model to obtain sensitivity weights of each network layer in the diffusion model at each time step; constructing a decoupling candidate set based on the bit width and the operator configuration; Obtaining the real reasoning delay of the decoupling candidate set, and constructing a delay lookup table; Based on the sensitivity weight and the delay lookup table, performing pareto search on the decoupling candidate set to obtain a full-time-sequence optimal quantization strategy path; and carrying out quantization treatment on the pre-trained diffusion model based on the optimal quantization strategy path of the full time sequence to obtain a quantized diffusion model.
- 2. The method for quantizing a diffusion model based on pareto search and decoupling operators according to claim 1, wherein the performing a sensitivity analysis on the pre-trained diffusion model in full time sequence to obtain a sensitivity weight of each network layer in the diffusion model at each time step comprises: constructing a calibration data set, inputting the calibration data set into a pre-trained diffusion model, obtaining the Fisher information of each network layer in the pre-trained diffusion model under each time step of the diffusion process, and taking the Fisher information as a sensitivity index of a quantized corresponding network layer under each time step; and sequencing all time steps according to the sensitivity indexes from high to low for each network layer, giving sensitivity weights to each time step according to the relative sizes of the sensitivity indexes, normalizing the sensitivity weights of all time steps to ensure that the sum of the sensitivity weights of all time steps in one network layer is 1, wherein the higher the sensitivity index is, the higher the sensitivity weight value corresponding to the time step is.
- 3. The pareto search and decoupling operator based diffusion model quantization method of claim 1, wherein said constructing a decoupling candidate set based on bit width and operator configuration comprises: constructing a globally unified decoupling candidate set, wherein the decoupling candidate set is used for providing selectable candidate strategies for all network layers and all time steps of a pre-training diffusion model, each candidate strategy consists of a bit width parameter and an operator configuration parameter, and each candidate strategy is specifically as follows: the high-fidelity mode strategy comprises the steps that the bit width parameter value is a preset high bit width, and the operator configuration parameter value is the starting Hadamard rotation; A standard mode strategy, wherein the bit width parameter value is a preset low bit width, and the operator configuration parameter value is a Hadamard rotation starting value; And (3) a very fast mode strategy, wherein the bit width parameter value is a preset low bit width, and the operator configuration parameter value is a closing Hadamard rotation.
- 4. A diffusion model quantization method based on pareto search and decoupling operators according to claim 3, wherein said obtaining true inference delays of said decoupling candidate set, constructing a delay look-up table, comprises: Selecting a target hardware platform, respectively testing real reasoning delay of each candidate strategy on each network layer of a pre-training diffusion model aiming at all candidate strategies in the decoupling candidate set on the target hardware platform, and constructing a delay lookup table based on test results, wherein the delay lookup table takes the network layer and the candidate strategy as index values, and associatively stores real reasoning delay data corresponding to the network layer and the candidate strategy.
- 5. A method for quantization of a diffusion model based on pareto search and decoupling operators according to claim 3, wherein said performing the pareto search on the decoupling candidate set based on the sensitivity weight and the delay lookup table to obtain a full-time-sequence optimal quantization strategy path comprises: Constructing a search tree, wherein each leaf node in the search tree corresponds to one combination unit, and the combination unit comprises a certain network layer in a pre-training diffusion model, a certain time step of the network layer participating in reasoning calculation and a certain candidate strategy adopted by the network layer under the corresponding time step; constructing a multi-objective comprehensive loss function based on the sensitivity weight and the delay lookup table; And combining the multi-target comprehensive loss function, and executing pareto grid search aiming at candidate strategies in the search tree to obtain a full-time-sequence optimal quantization strategy path.
- 6. The diffusion model quantization method based on pareto search and decoupling operators according to claim 5, wherein the expression of the multi-objective comprehensive loss function is as follows: ; Wherein, the Representing the loss of the ith network layer at time step t; representing the sensitivity weight corresponding to the ith network layer under the time step t; representing real reasoning delay data stored in the delay lookup table by taking the ith network layer and the corresponding candidate strategy under the time step t as index values; is a preset weighing coefficient.
- 7. The method for quantizing a diffusion model based on a pareto search and decoupling operator according to claim 5, wherein the combining the multi-objective comprehensive loss function performs the pareto search for candidate strategies in a search tree to obtain a full-time-sequence optimal quantization strategy path, and the method comprises: And (3) reasoning a time axis along the diffusion model, taking time steps as iteration dimensions, and executing the following operations step by step: Selecting only 1 rule of the leaf nodes corresponding to each network layer in the time step according to each network layer, selecting 1 corresponding leaf node for each network layer in the t time step node total set, and combining the selected leaf nodes, wherein each combination result is a group of candidate strategy combinations in the t time step; generating all paths at the t-th time step: if t=1, each group of candidate strategy combinations in the t time step is taken as a path; if t >1, calling all basic paths screened by the previous time step, and carrying out time sequence splicing on each basic path and each group of candidate strategy combinations of the t time step, wherein the splicing result is one path of the t time step; Under the t-th time step, a multi-target comprehensive loss function is called to acquire the loss of each path, wherein the loss of each path is the sum of the loss accumulation of all network layers in the 1~t th time step, pareto pruning is carried out on all paths, and K paths with the minimum loss are screened out to serve as basic paths for iteration of the t+1th time step; when iterating to t=total time step, the iterating is terminated and the path with the minimum loss is taken as the optimal quantization strategy path.
- 8. The method for quantizing a diffusion model based on a pareto search and decoupling operator according to claim 7, wherein the performing quantization processing on the pre-trained diffusion model by using the optimal quantization strategy path based on the full time sequence to obtain a quantized diffusion model comprises: In the reasoning process of the pre-trained diffusion model, based on candidate strategy combinations of the optimal quantization strategy paths of the full time sequence under different time steps, the bit width and operator configuration of each network layer under the time steps are quantized, so that the quantization processing of the pre-trained diffusion model is realized.
- 9. A diffusion model quantization system based on pareto search and decoupling operators, comprising: The weight analysis module is used for carrying out sensitivity analysis of the full time sequence on the pre-trained diffusion model to obtain the sensitivity weight of each network layer in the diffusion model at each time step; the decoupling module is used for constructing a decoupling candidate set based on the bit width and operator configuration; the delay acquisition module is used for acquiring the real reasoning delay of the decoupling candidate set and constructing a delay lookup table; the path generation module is used for executing pareto search on the decoupling candidate set based on the sensitivity weight and the delay lookup table to obtain an optimal quantization strategy path with full time sequence; and the quantization module is used for carrying out quantization processing on the pre-trained diffusion model based on the optimal quantization strategy path of the full time sequence to obtain a quantized diffusion model.
- 10. An image generation method, comprising: determining a pre-trained FLUX meristematic graph model; Carrying out quantization processing on the pre-trained FLUX venturi graph model by adopting the diffusion model quantization method based on the pareto search and decoupling operator according to any one of claims 1 to 8, and determining a quantized FLUX venturi graph model; Inputting preset image generation text into the quantized FLUX text generation graph model, and determining the generated image.
Description
Diffusion model quantization method and system based on pareto search and decoupling operator Technical Field The application relates to the technical field of artificial intelligence and computer vision, in particular to a diffusion model quantization method and system based on pareto search and decoupling operators. Background With the wide application of the diffusion converters (Diffusion Transformers, diTs) in the field of image and video generation (such as Flux, cogVideoX), huge parameter amounts and a multi-step iterative generation mechanism bring about huge calculation overhead and video memory occupation. In order to implement edge-side deployment, model Quantization (Quantization) has become a key technology. However, existing post-training quantization (PTQ) techniques for diffusion models suffer from the following significant drawbacks: 1. The static quantization strategy cannot adapt to the dynamic generation process-existing quantization methods (e.g., SVDQuant, Q-DiT) typically employ a "static" strategy, i.e., the forced model uses a uniform bit width (e.g., W4 A4) or a uniform operator configuration (e.g., always on rotation) throughout all time steps (TIMESTEPS) of the denoising process. The method ignores time sequence non-uniformity in the diffusion generation process, namely extremely sensitive to outliers and structural information in the initial generation stage (composition period) and needs high-precision or strong operator protection, and sparse activation value in the final generation stage (texture period) and high tolerance to noise, so that huge computational redundancy exists. The static strategy causes contradiction of insufficient precision of key steps and waste of calculation force of redundant steps. 2. Decoupling of Search space from hardware features existing hybrid-Precision Search methods (Mixed-Precision Search) typically only use "bit width" as the Search variable, ignoring the impact of quantization operators (e.g., whether hadamard rotation is turned on, whether reordering is performed) on the model performance. In practice, there is a coupling effect of the strength of the quantization operator with the time step sensitivity (e.g., a strong rotation operator allows low bits to be used early, but increases computational overhead). In addition, the existing method mostly uses theoretical calculation amount (Bit-ops) as a cost function of searching, which cannot reflect operator switching overhead and access delay on real hardware, so that the acceleration effect of the searched strategy in actual deployment is not obvious. 3. The lack of global consideration for error accumulation is that existing Layer-wise or Block-wise quantization methods (e.g., CLQ, S2Q-VDiT) focus mainly on the current local reconstruction errors. However, the essence of diffusion generation is to solve the Ordinary Differential Equation (ODE), and small quantization errors of early steps can undergo trajectory deviation (Trajectory Deviation) and accumulation over time steps. Optimizing only single step errors and ignoring track consistency for full timing can result in collapse or semantic drift of the generated image structure. According to the technical literature search, the Chinese patent with the publication number of CN117892792A provides a mixed precision quantization method for generating an image, and quantization bit widths are distributed to different layers according to the sensitivity of the different layers to quantization, so that the generation of a diffusion model is accelerated more reasonably and efficiently. The above application performs dynamic resource allocation, but lacks global consideration for error accumulation, and the structure generated at low bits is liable to collapse. Therefore, there is a need for a method and system for quantization of a spread model that can implement a dynamic quantization strategy and still maintain consistency of generated content semantics at low average bits. Disclosure of Invention Aiming at the defects in the prior art, the application aims to provide a diffusion model quantization method and a diffusion model quantization system based on pareto search and decoupling operators. According to a first aspect of the present application, there is provided a diffusion model quantization method based on pareto search and decoupling operators, comprising: performing sensitivity analysis on the full time sequence of the pre-trained diffusion model to obtain sensitivity weights of each network layer in the diffusion model at each time step; constructing a decoupling candidate set based on the bit width and the operator configuration; Obtaining the real reasoning delay of the decoupling candidate set, and constructing a delay lookup table; Based on the sensitivity weight and the delay lookup table, performing pareto search on the decoupling candidate set to obtain a full-time-sequence optimal quantization strategy path; and carrying out q