CN-121981255-A - Diffusion model reasoning acceleration method based on space-time joint cache
Abstract
The diffusion model reasoning acceleration method based on space-time joint caching comprises the steps of calculating difference quotient of time dimension and space dimension as stability measurement indexes for each module in each time step of a diffusion model to evaluate caching tendency score, selecting optimal operation of each module based on the caching tendency score and a set threshold value, wherein the optimal operation comprises complete calculation, time caching or space caching, when the selected optimal operation is complete calculation, the module executes a forward process and updates caching characteristics, when the selected optimal operation is time caching, a more accurate characteristic approximation is obtained by adopting a characteristic prediction method based on interpolation and is used as output of the module, and when the selected optimal operation is space caching, the module multiplexes caching characteristics of a previous module as output. The invention performs feature caching in time and space dimensions simultaneously, obviously improves the reasoning acceleration effect of the diffusion model, and has excellent generation quality maintaining capability.
Inventors
- DENG YONGHENG
- DU CHENXI
- REN JU
Assignees
- 清华大学
Dates
- Publication Date
- 20260505
- Application Date
- 20251229
Claims (10)
- 1. A diffusion model reasoning acceleration method based on space-time joint buffer is characterized by comprising the following steps: In each time step of the diffusion model, respectively calculating a difference quotient of a time dimension and a space dimension as a stability measurement index aiming at each module to evaluate a cache tendency score, and selecting the optimal operation of each module based on the cache tendency score and a set threshold value, wherein the optimal operation comprises complete calculation, time cache or space cache; When the optimal operation selected for a certain module is complete calculation, the certain module executes a forward process and updates the cache characteristic; When the optimal operation selected for a certain module is time caching, a feature prediction method based on interpolation is adopted to obtain more accurate feature approximation, and the feature approximation is used as the output of the certain module; When the optimal operation selected for a certain module is to perform space caching, the certain module directly multiplexes the caching characteristics of the previous module as output.
- 2. The diffusion model reasoning acceleration method of claim 1, wherein each module of the diffusion model calculates a difference quotient of a time dimension and a space dimension as a stability metric index to evaluate a cache tendency score, comprising: For a certain module, setting a history buffer characteristic sequence with a given length M nearest to the module as Calculating the history buffer feature sequence Order difference quotient : Wherein, for the time dimension, 、 Representing the fourth calculation in the history buffer feature sequence 、 A time step of calculating a time dimension The step difference quotient is recorded as And, in respect of the dimensions of the space, 、 Representing the full computation in the history buffer feature sequence 、 A module for calculating the spatial dimension The step difference quotient is recorded as ; , And (2) and ; Is the first A caching feature of a time step or module; Defining a cache tendency score as , 。
- 3. The diffusion model reasoning acceleration method of claim 2, wherein the selecting the optimal operation of each module based on the cache tendency score and the set threshold comprises: If it is Selecting complete calculation; If it is And is also provided with Selecting a time buffer; If it is And is also provided with Then a spatial cache is selected.
- 4. The diffusion model reasoning acceleration method of claim 2, characterized in that, Is a positive integer of 2 to 4.
- 5. The diffusion model reasoning acceleration method of claim 1, wherein, for a certain module, a history buffer feature sequence with a given length of 3 nearest to the module is set as By second order difference quotient As the stability measurement index, the calculation formula is as follows: Wherein, the Caching features for usage history And The first order difference quotient calculated is used, Caching features for usage history And A first order difference quotient calculated, for the time dimension, Corresponding to the first 3 time steps closest to the certain module for full computation, And corresponding to the first 3 modules closest to the certain module for performing full calculation.
- 6. The diffusion model reasoning acceleration method of claim 1, wherein the obtaining more accurate feature approximations using the interpolation-based feature prediction method when the selected optimal operation is time-buffered, comprises: Setting a predicted current time step Is characterized by approximately Calculated according to the following formula: Wherein, the For historical time steps Each module maintains the first M cache features nearest to the respective module, , And (2) and ; Representing historical time steps as interpolation weighting functions based on relative positions of time steps The caching feature predicts the current time step Degree of contribution when the feature approximates.
- 7. The diffusion model reasoning acceleration method of claim 1, wherein introducing a bounded cache distance control mechanism in performing the temporal or spatial caching comprises: Defining the maximum continuous buffer distance allowed by the time dimension as the time limit value The maximum continuous buffer distance allowed by the space dimension is a space limit value ; When the continuous cache count of any dimension in time and space reaches its set limit, the cache of that dimension is no longer considered for optimal operation selection, and if both the time and space dimensions reach their set limits, then the complete calculation is forced.
- 8. A diffusion model reasoning acceleration device based on space-time joint buffer memory is characterized by comprising: An operation selector based on redundant guidance, configured to calculate, for each module, a difference quotient of a time dimension and a space dimension as a stability metric index, respectively, in each time step of the diffusion model, to evaluate a cache tendency score, and to select an optimal operation of each module, including a complete calculation, a time cache, or a space cache, based on the cache tendency score and a set threshold; an operation actuator configured to: when the optimal operation selected for a certain module is complete calculation, the certain module is enabled to execute a forward process and update the cache characteristic; When the optimal operation selected for a certain module is time caching, a feature prediction method based on interpolation is adopted to obtain more accurate feature approximation, and the feature approximation is used as the output of the certain module; when the optimal operation selected by a certain module is space caching, the certain module directly multiplexes the caching characteristics of the previous module as output.
- 9. The diffusion model inference acceleration apparatus of claim 8, further comprising a bounded cache distance controller configured to: Defining the maximum continuous buffer distance allowed by the time dimension as the time limit value The maximum continuous buffer distance allowed by the space dimension is a space limit value ; When the continuous cache count of any dimension in time and space reaches its set limit, the cache of that dimension is no longer considered for optimal operation selection, and if both the time and space dimensions reach their set limits, then the complete calculation is forced.
- 10. A computer-readable storage medium storing computer instructions for causing the computer to execute the diffusion model reasoning acceleration method of any one of claims 1-7.
Description
Diffusion model reasoning acceleration method based on space-time joint cache Technical Field The invention relates to the technical field of deep learning model reasoning acceleration, in particular to a training-free reasoning acceleration method based on space-time joint caching aiming at a Diffusion model (Diffusion Models). Background Diffusion Models (Diffusion Models) are a class of generation Models that learn data distribution through the inverse process of modeling noise stepwise addition, and have been remarkably successful in various generation tasks such as image synthesis, video generation, 3D modeling, and multi-modal learning. The diffusion model contains a forward process and a reverse process. The forward process is to extract data samples from the data distributionGradually adding Gaussian noise according to predefined noise variance scheduling, passingAfter the step the data was completely converted to noise. The reverse process is to start from random Gaussian noise, iterate denoising through a neural network, and gradually recover the original data sample。 While the early diffusion model mainly adopts a U-Net architecture, the diffusion converter (Diffusion Transformer, diT) model based on a transducer architecture has been the mainstream in recent years due to its excellent scalability, strong modeling capability and excellent generation quality. DiT model is generally composed ofThe same transducer module is composed, and the generation process of each time step can be regarded asIn whichRepresent the firstOperation of the transducer module. However, the high computational cost of the diffusion model is mainly due to the large number of iterative sampling steps required, and the inherent complexity of the denoising network in each step. This makes deployment of DiT models in real-time or resource-constrained scenarios a significant challenge. To accelerate diffusion model reasoning, the existing research is mainly developed from two orthogonal directions, namely (1) a method for reducing the sampling steps, including an advanced random differential equation (SDE) solver and a normal differential equation (ODE) solver (such as DDIM, DPM++, and the like), a progressive distillation and consistency model, and the like. (2) Methods for reducing the cost of computation per step-including model compression methods (quantization, pruning), lightweight model design, and the like. In recent years, cache-based methods have received widespread attention as an acceleration direction that does not require additional training. The core idea of this type of approach is to exploit the similarity of features between adjacent time steps (i.e., temporal redundancy) to reduce redundant computation by caching and reusing intermediate computation results. In particular, existing caching methods divide the inference timeline into uniform groups. For the firstA group comprising a sequence of N time stepsOnly in the first time stepFor all ofThe individual modules perform the complete calculation and cache the results as,Is a time stepTime moduleFor the input tensor of (c)Subsequent within the groupAll modules reuse the cached features directly, in time steps. Thus, there is still room for improvement in caching strategies. Disclosure of Invention The present invention aims to solve at least one of the technical problems existing in the related art to some extent. Therefore, the invention aims to provide a diffusion model reasoning acceleration method based on space-time joint caching, which is a training-free acceleration strategy and can flexibly perform feature caching in two dimensions of time and space. In order to achieve the above purpose, the present invention adopts the following technical scheme: The diffusion model reasoning acceleration method based on space-time joint buffer provided by the first aspect of the invention comprises the following steps: In each time step of the diffusion model, respectively calculating a difference quotient of a time dimension and a space dimension as a stability measurement index aiming at each module to evaluate a cache tendency score, and selecting the optimal operation of each module based on the cache tendency score and a set threshold value, wherein the optimal operation comprises complete calculation, time cache or space cache; When the optimal operation selected for a certain module is complete calculation, the certain module executes a forward process and updates the cache characteristic; When the optimal operation selected for a certain module is time caching, a feature prediction method based on interpolation is adopted to obtain more accurate feature approximation, and the feature approximation is used as the output of the certain module; When the optimal operation selected for a certain module is to perform space caching, the certain module directly multiplexes the caching characteristics of the previous module as output. In some embodiments, each modul