Search

CN-121835936-B - End cloud collaborative speculative decoding acceleration method based on credit inertia

CN121835936BCN 121835936 BCN121835936 BCN 121835936BCN-121835936-B

Abstract

The invention discloses a credit inertia-based end cloud collaborative speculative decoding acceleration method, which belongs to the field of edge cloud collaborative computing and is characterized in that efficiency and stability of end cloud collaborative generation are improved through credit inertia sensing, self-adaptive threshold prediction and asynchronous parallel scheduling, a robust distribution level verification criterion is constructed by firstly utilizing a distribution difference of a Jensen-Shannon divergence measurement draft model and a cloud target model, then 'credit inertia' characteristics of a historical window are extracted through an exponentially weighted moving average, an acceptance threshold and a draft window length are self-adaptively adjusted through a lightweight threshold prediction network, and finally a progressive pre-aiming concurrency mechanism is introduced to decouple draft generation, cloud pre-examination and full examination into an asynchronous assembly line, so that end-to-end time delay is obviously reduced, invalid rollback is reduced, and an effective token throughput rate is improved on the premise of ensuring generation quality, and the method is suitable for high-real-time applications such as mobile album, multi-mode retrieval and description generation.

Inventors

  • WANG HAOZHAO
  • LI SENYAO
  • LI RUIXUAN
  • YI XIAOQUAN
  • LI SHIWEI

Assignees

  • 华中科技大学

Dates

Publication Date
20260508
Application Date
20260312

Claims (6)

  1. 1. The end cloud collaborative speculative decoding acceleration method based on credit inertia is characterized by comprising the following steps of: S1, a draft window formed by 1~W th token predicted and output by a draft model at the edge side is used as a current decoding window to be sent to the cloud side, and meanwhile, the draft model continuously predicts and outputs; The draft model is input into a digital sequence corresponding to a natural language text input by a user; S2, verifying the first token in the current decoding window by the target model at the cloud side according to the verification threshold value of the current decoding window, if verification is passed, sequentially verifying other tokens in the current decoding window by the target model according to the verification threshold value of the current decoding window, otherwise, generating a whole window rejection signal by the target model, discarding all tokens in the current decoding window, resampling to generate a new token, transmitting the new token and the whole window rejection signal to a draft model, updating context by the draft model according to the new token, and finishing verification of the current decoding window; When other token in the current decoding window is sequentially verified according to the verification threshold value of the current decoding window, discarding the token and the rest token in the current decoding window if the token which fails to be verified appears, resampling to generate a new token, transmitting the new token and the serial number of the token which passes to the draft model, discarding all token after the token which passes to be verified by the draft model, adding the resampled token to the tail of the token which passes to be verified by the draft model, updating context, and finishing verification of the current decoding window; S3, respectively taking a draft window formed by the W+12W token, a draft window formed by the 2W+13W token, an AW+1to (A+1) x W token, which are predicted and output by the draft model, as a current decoding window and returning to S2 until a termination condition is met, wherein A is more than 1; In step S2, if the current decoding window is the first Distribution divergence of individual token If the verification threshold value is smaller than or equal to the verification threshold value of the current decoding window, the verification is passed, otherwise, the verification is not passed; Wherein, the ; And The probability distribution of the ith token output by the draft model and the target model respectively, In order to achieve a mixed distribution of the components, , To be distributed in a mixture Lower part(s) A kind of electronic device The degree of dispersion is determined by the degree of dispersion, To be distributed in a mixture Lower part(s) A kind of electronic device Divergence.
  2. 2. The method of claim 1, wherein in step S2, before the target model at the cloud end verifies the first token in the decoding window according to the verification threshold of the current decoding window, the method further comprises: (1) Calculating the instant passing rate of K historical decoding windows t-1, t-2, and the third and the fourth, t-K adjacent to the current decoding window t Wherein, the method comprises the steps of, , Decoding windows for history Is used for the control of the passage rate of the air, j=t-1, t-2,. And t-K, , Decoding windows for history Middle (f) Distribution divergence of the individual token(s), Decoding windows for history Is a verification threshold of (2); in order to indicate the function, In the time-course of which the first and second contact surfaces, Otherwise ; (2) Calculating average pass rates of K historical decoding windows t-1, t-2, and/or t-K adjacent to the current decoding window t Wherein, the method comprises the steps of, = ; (3) According to Calculating smooth passing rates of K historical decoding windows t-1, t-2, and t-K adjacent to the current decoding window t Wherein, the method comprises the steps of, , As the weight coefficient of the light-emitting diode, Smoothing pass rates for K historical decoding windows t-2, t-3, t-1-K adjacent to the last decoding window t-1; (4) Will be And (3) with Inputting to a pre-trained predictor to obtain a verification threshold value of the current decoding window t Predicted value of (2) If (if) And (3) with The absolute value of the difference value of the (c) exceeds the set value, and the absolute value of the difference value of the predicted value and the verification threshold value of each of the first P history decoding windows adjacent to the current decoding window exceeds the set value The value of (2) is updated to Value of (2), otherwise The value of (2) remains unchanged.
  3. 3. The method of claim 2, wherein the training process of the predictor comprises: (1) Constructing training data set, setting the verification threshold value of all decoding windows as given threshold value T, decoding natural language text sample inputted by user, calculating each decoding window under different given threshold value T configuration Selecting N decoding windows with the minimum joint cost, and taking the smooth passing rate and the instant passing rate of K historical decoding windows adjacent to each decoding window in the N decoding windows and a corresponding given threshold value as a training data set; wherein each decoding window Joint cost at a given threshold T configuration ; Representation of End-to-end delay in the configuration of a given threshold T, including draft model generation Time, object model verification of (2) Time required for returning the verification result to the draft model; Representation of The pass rate at a configuration given a threshold T; Is a trade-off coefficient; (2) The method comprises the steps of training a predictor by taking smooth passing rate and instant passing rate of K historical decoding windows adjacent to the decoding windows in a training data set as input of the predictor and taking the difference between a predicted value of a verification threshold value of the decoding window output by the predictor and a given threshold value as a target, wherein the predictor is a neural network.
  4. 4. An electronic device comprises a computer readable storage medium and a processor; The computer-readable storage medium is for storing executable instructions; the processor is configured to read executable instructions stored in the computer readable storage medium and perform the method of any one of claims 1-3.
  5. 5. A computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, the computer instructions for causing a processor to perform the method of any one of claims 1-3.
  6. 6. A computer program product comprising a computer program or instructions which, when executed by a processor, implements the method of any of claims 1-3.

Description

End cloud collaborative speculative decoding acceleration method based on credit inertia Technical Field The invention belongs to the field of edge cloud collaborative computing, and particularly relates to a credit inertia-based end cloud collaborative speculative decoding acceleration method. Background With the wide deployment of large models in key services such as intelligent customer service, code generation, traffic scheduling and the like, end user oriented interactive applications are rapidly moving from 'cloud centralization' to 'end cloud collaboration assimilation'. In this process, the time delay, energy consumption and cost of large language model reasoning directly affect the business experience and system scalability. Because the cloud large model parameter is huge in scale and intensive in calculation, real-time response requirements under massive concurrent requests are difficult to meet by relying on cloud side theory alone, and the complete model is sunk to the edge or the mobile equipment and is subjected to multiple constraints of calculation force, storage and power consumption. Therefore, how to realize the low-delay and high-precision generation type service through the collaborative reasoning of the lightweight small model and the cloud large model under the end cloud collaborative framework has become a key problem in the current intelligent application infrastructure. In a plurality of acceleration technologies, the speculative decoding provides a new acceleration path for large model reasoning on a system level by introducing a draft-verification dual-model structure, namely a section of candidate token sequence is firstly generated by a small model at the end side or near end, and verification and correction are carried out by a cloud large model, so that the decoding throughput rate is improved on the premise of not obviously sacrificing the precision. However, most of the existing speculative decoding schemes are designed in stand-alone or homogenous cluster scenarios, implying the assumption that the computation and communication resources are relatively abundant and stable. When the paradigm is migrated to an actual end cloud cooperative environment, factors such as limited uplink bandwidth, obvious network delay fluctuation, high end-to-end computing power heterogeneity and the like can amplify risks caused by distribution deviation between a draft model and a target model, and once the draft sequence is rolled back on a large scale, not only is expected accelerating income lost, but also additional recalculation and communication expenditure can be introduced, so that the end-to-end delay is increased instead. On the other hand, existing methods typically rely on a fixed acceptance threshold or simplified statistical assumption that is preset to determine whether a draft token is "acceptable" to the target model. The static threshold value shows obvious stiffness problem under the real traffic flow, namely, when the input context, task type or network condition changes, the acceptable high-quality draft is frequently refused by the too conservative threshold value, the effective throughput is reduced, and the error accumulation and large-scale rollback are caused by the too aggressive threshold value, so that the generation quality is influenced. In the scenes of continuous dialogue, multi-round interaction, long text generation and the like, the accepted or rejected modes of the draft often have obvious time correlation, but the cross-window statistical information and the 'inertial' characteristics of the type are not systematically mined and utilized, so that the end cloud collaborative prediction decoding is difficult to adaptively adjust strategies according to historical expression. Meanwhile, from the system perspective, the existing end cloud collaborative reasoning framework often has obvious phenomenon of waiting each other between draft generation and cloud verification, namely, an end side needs to wait for a verification result of a previous token to safely push a context, the cloud side also depends on the end side to upload a complete draft sequence to start efficient parallel verification, unstable network transmission delay is overlapped, and performance bottlenecks in a reasoning assembly line are easily formed. The lack of joint modeling and dynamic scheduling of opposite-end-side computation, cloud-side load and network conditions also makes it difficult for the system to consider multi-dimensional targets such as response time, cloud energy consumption, quality of service and the like under different service scenarios. Disclosure of Invention Aiming at the defects or improvement demands of the prior art, the invention provides a credit inertia-based end cloud collaborative speculative decoding acceleration method, so that performance bottlenecks caused by fixed threshold stiffness, end cloud waiting and network delay fluctuation in traditional end clo