Search

CN-120806155-B - Diffusion model-based large language model reasoning method, device, terminal and medium

CN120806155BCN 120806155 BCN120806155 BCN 120806155BCN-120806155-B

Abstract

The application provides a large language model reasoning method, device, terminal and medium based on a diffusion model. The application utilizes the draft model based on the diffusion model frame to generate the draft sequence, and then verifies the draft sequence through the large language model. The application utilizes the characteristic that the diffusion model naturally supports parallel processing, so that the length of a draft sequence is obviously increased, and the times of verification of a large model are further reduced, thereby improving the reasoning efficiency of the large language model and the utilization efficiency of computing resources.

Inventors

  • Request for anonymity
  • Request for anonymity

Assignees

  • 上海光羽芯辰科技有限公司

Dates

Publication Date
20260508
Application Date
20250708

Claims (6)

  1. 1. A large language model reasoning method based on a diffusion model is characterized by comprising the following steps: The method comprises the steps of obtaining a pre-trained draft model based on a diffusion model framework and determining a large language model to be used, wherein the mode of training the draft model specifically comprises the steps of constructing a training set of the draft model by utilizing the large language model; Configuring a draft model according to a preset draft model configuration parameter; Obtaining context information to be inferred and converting the context information to be inferred into a context sequence to be inferred, taking the context sequence to be inferred as a current context sequence, decoding the current context sequence to obtain an intermediate output sequence, executing termination judgment operation based on the intermediate output sequence according to preset termination conditions to obtain a corresponding judgment result, taking the intermediate output sequence as a final output sequence if the obtained judgment result meets the termination conditions, and taking the intermediate output sequence as the current context sequence and decoding the intermediate output sequence if the obtained judgment result does not meet the termination conditions until the judgment result meeting the termination conditions is obtained; The decoding operation comprises a draft sequence generating operation based on a draft model after parameter configuration, a verification operation based on a large language model, a prefix receiving determining operation and a condition judging and correcting operation which are sequentially executed; The long-sequence draft generating operation based on the draft model after parameter configuration comprises the steps of performing back diffusion processing on a current context sequence by utilizing the draft model after parameter configuration to generate a plurality of token representations, and converting the token representations into draft sequences according to a vocabulary; Inputting a current context sequence and a draft sequence into a large language model for parallel evaluation to obtain an optimal token sequence; the operation of determining and receiving the prefix comprises the steps of outputting a token part consistent with an optimal token sequence from a first token in a draft sequence as a prefix sequence, and outputting the number of the tokens in the prefix sequence as the number of tokens accepted by a large language model; The condition judging operation comprises the steps of executing a correcting operation to obtain a corrected intermediate output sequence if the number of accepted tokens is smaller than the length of a draft sequence, and taking the corrected intermediate output sequence as a final intermediate output sequence; And carrying out text conversion on the final output sequence, and taking the converted result as reasoning result text information corresponding to the context information to be inferred.
  2. 2. The large language model reasoning method based on the diffusion model of claim 1, wherein the draft model configuration parameters comprise a length of a draft sequence and a denoising step number of the draft model.
  3. 3. The large language model reasoning method based on the diffusion model of claim 1, wherein the correction operation comprises inputting a prefix sequence and a current context sequence into the large language model to obtain a corrected token, and splicing the corrected token with the preliminary intermediate output sequence to obtain a corrected intermediate output sequence.
  4. 4. A large language model reasoning apparatus based on a diffusion model, comprising: the model acquisition module is used for acquiring a pre-trained draft model based on a diffusion model framework and determining a large language model to be used, wherein the mode of training the draft model specifically comprises the steps of constructing a training set of the draft model by utilizing the large language model; The model configuration module is used for configuring the draft model according to the pre-obtained draft model configuration parameters; The reasoning module is used for acquiring the context information to be inferred and converting the context information to be inferred into a context sequence to be inferred; the method comprises the steps of obtaining a context sequence to be inferred, taking the context sequence to be inferred as a current context sequence, carrying out decoding operation on the current context sequence to obtain an intermediate output sequence, carrying out termination judgment operation based on the intermediate output sequence according to a preset termination condition to obtain a corresponding judgment result, taking the intermediate output sequence as a final output sequence if the obtained judgment result meets the termination condition, taking the intermediate output sequence as the current context sequence and carrying out decoding operation on the intermediate output sequence if the obtained judgment result does not meet the termination condition until the judgment result meeting the termination condition is obtained; The decoding operation comprises a draft sequence generating operation based on a draft model after parameter configuration, a verification operation based on a large language model, a prefix receiving determining operation and a condition judging and correcting operation which are sequentially executed; The long-sequence draft generating operation based on the draft model after parameter configuration comprises the steps of performing back diffusion processing on a current context sequence by utilizing the draft model after parameter configuration to generate a plurality of token representations, and converting the token representations into draft sequences according to a vocabulary; Inputting a current context sequence and a draft sequence into a large language model for parallel evaluation to obtain an optimal token sequence; the operation of determining and receiving the prefix comprises the steps of outputting a token part consistent with an optimal token sequence from a first token in a draft sequence as a prefix sequence, and outputting the number of the tokens in the prefix sequence as the number of tokens accepted by a large language model; The condition judging operation comprises the steps of executing a correcting operation to obtain a corrected intermediate output sequence if the number of accepted tokens is smaller than the length of a draft sequence, and taking the corrected intermediate output sequence as a final intermediate output sequence; and the result generation module is used for carrying out text conversion on the final output sequence and taking the converted result as reasoning result text information corresponding to the context information to be inferred.
  5. 5. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any one of claims 1 to 3.
  6. 6. An electronic terminal comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the method of any one of claims 1 to 3.

Description

Diffusion model-based large language model reasoning method, device, terminal and medium Technical Field The application relates to the technical field of large language models, in particular to a large language model reasoning method, device, terminal and medium based on a diffusion model. Background Large language models (Large Language Models, LLMs), such as models based on the Transformer architecture, exhibit excellent performance in many fields of natural language processing, etc. However, the reasoning mechanism of the autoregressive formula, namely, generating texts from token to token, leads to higher reasoning delay, and limits the deployment of the autoregressive formula in application scenes with higher requirements on real-time performance. To alleviate this problem, speculative decoding (Speculative Decoding) techniques have evolved. The technique typically uses a small, fast Draft Model (Draft Model) to pre-generate sequences of candidate tokens, which are then validated in parallel by a target large language Model (TARGET LLM). If the verification is passed, a plurality of token can be accepted at one time, so that the purpose of accelerating reasoning is achieved. In the prior art, draft models are typically miniaturized versions of the target large language model (e.g., small models obtained by knowledge distillation, co-constructed models with a small amount of parameters) or structurally simplified autoregressive models. However, the conventional speculative decoding technique has the following problems: (1) Draft generation length is limited in that k serial decoding steps are required for generating k token's in the existing autoregressive draft model. In order to control the delay of draft generation not to be too large to counteract the acceleration effect, the value of k is usually very limited, for example, only 3 to 5 token, which fundamentally limits the potential maximum acceleration benefit that can be brought by word-speculative decoding operations. (2) The number of actually accepted tokens is limited, because k is small, and deviation may exist between the generation quality and distribution of the draft model and the target large language model, so that the number n of tokens actually verified and received by the large language model is usually small, and the average number of tokens may be only 2-3. This makes the effective acceleration of single speculative decoding very limited. (3) The overall acceleration effect is not obvious, and as the number of tokens which can be received by single speculative decoding is small, more decoding iteration times are needed to generate a sequence with a target length, the overall acceleration effect of speculative decoding is poor, and the increasingly higher requirement on the reasoning speed of a large language model is difficult to meet. (4) The balance problem of the generation efficiency and the quality of the draft model is that the existing draft model needs to be weighted with the generation speed and the generation quality. While the too simple draft model has high speed, the generation quality is poor, the acceptance rate is low, the draft model with higher quality is not fast enough possibly because of the nature of autoregressive, the accelerating meaning of the speculative decoding is weakened, and the number of token generated at one time is still limited. Disclosure of Invention In view of the above-mentioned drawbacks of the prior art, an object of the present application is to provide a large language model reasoning method, device, terminal and medium based on a diffusion model, so as to solve the problem that the overall acceleration effect is poor due to the limited length of the candidate token sequence generated by the existing draft model and the small number of tokens actually accepted at a time. To achieve the above object and other related objects, a first aspect of the present application provides a large language model reasoning method based on a diffusion model, including obtaining a pre-trained draft model based on a diffusion model architecture and determining a large language model to be used, configuring the draft model according to pre-obtained draft model configuration parameters, obtaining context information to be reasoning and converting it into a context sequence to be reasoning, decoding the current context sequence to obtain an intermediate output sequence, performing a termination judgment operation based on the intermediate output sequence according to a preset termination condition, obtaining a corresponding judgment result, if the obtained judgment result is that the termination condition is met, taking the intermediate output sequence as a final output sequence, and if the obtained judgment result is that the termination condition is not met, taking the intermediate output sequence as the current context sequence and performing a decoding operation until the termination condition is met, wh