Search

CN-122021564-A - Method and device for generating text

CN122021564ACN 122021564 ACN122021564 ACN 122021564ACN-122021564-A

Abstract

The embodiment of the specification provides a method for generating text, which comprises a multi-round iteration process, wherein any round of iteration process comprises the steps of generating a draft word sequence by utilizing a draft model aiming at a current word sequence, wherein the current word sequence comprises an input word sequence of a target task and an accepted generated word sequence, the draft model comprises a plurality of first network layers, at least one first network layer in the plurality of first network layers is a dynamic calculation layer which is used for dynamically selecting part of words from the current word sequence to perform attention calculation, verifying whether each draft word in the draft word sequence is correct or not by utilizing the target model, adding each draft word before the first draft word verified to be wrong into the generated word sequence, and generating the word sequence to be used for forming a generated text of the target task, wherein the target model is more than the draft model parameters. The reasoning speed of the large language model can be effectively improved.

Inventors

  • WANG XIMING
  • ZHU JIANGCAI
  • SHAO KAILAI
  • CHEN CHAO
  • HU HAIXIANG

Assignees

  • 支付宝(杭州)数字服务技术有限公司

Dates

Publication Date
20260512
Application Date
20260112

Claims (17)

  1. 1. A method of generating text comprising a multiple round iterative process, wherein any round of iterations comprises: Generating a draft word element sequence by utilizing a draft model aiming at a current word element sequence, wherein the current word element sequence comprises an input word element sequence of a target task and an accepted generated word element sequence, and the draft model comprises a plurality of first network layers; and verifying whether each draft word element in the draft word element sequence is correct or not by using a target model, adding each draft word element before the first draft word element verified to be incorrect into the generated word element sequence, wherein the generated word element sequence is used for forming a generated text of the target task, and the target model has more parameters relative to the draft model.
  2. 2. The method of claim 1, wherein the first network layer comprises a self-attention sub-layer and a multi-layer perceptron.
  3. 3. The method of claim 1, wherein the dynamic computation layer comprises a self-attention sub-layer, a multi-layer perceptron, and a routing unit configured to output importance scores for each of the tokens in the current token sequence based on current characterizations of each of the tokens, select k tokens with highest scores to route to the self-attention sub-layer and the multi-layer perceptron, and the remaining tokens skip the self-attention sub-layer and the multi-layer perceptron.
  4. 4. A method according to claim 3, wherein the routing unit employs a feed forward network or a linear network.
  5. 5. The method of claim 1, wherein a last first network layer of the plurality of first network layers is not a dynamic computing layer, and the remaining first network layers are dynamic computing layers.
  6. 6. The method of claim 3, wherein the plurality of first network layers includes at least two dynamic computing layers, and wherein different dynamic computing layers have k values that decrease in order of processing the input data.
  7. 7. The method of claim 1, wherein the object model comprises a plurality of second network layers, at least one of the plurality of second network layers being the dynamic computing layer.
  8. 8. The method of claim 7, wherein the first network layer or the second network layer is a transducer layer.
  9. 9. The method of claim 1, wherein the input of the draft model includes a token representation of each token in the current sequence of tokens, the token representation of any target token being a superposition of an initial representation of the target token and a feature of a last token of the target token output by a plurality of hidden layers of the target model.
  10. 10. The method of claim 1, wherein the draft model and the target model are models that are respectively pre-trained.
  11. 11. The method of claim 1, wherein the draft model is a model obtained by performing distillation learning by learning an output probability distribution or a hidden state of the target model.
  12. 12. A method according to claim 3, wherein the parameters of the routing unit are trained in an end-to-end manner with the entire draft model.
  13. 13. The method of claim 12, wherein the training penalty of the draft model includes a cross entropy penalty of the draft token sequence and the validated generated token sequence, and a routing penalty positively correlated with the selected number of tokens in the dynamic computation layer.
  14. 14. The method of claim 1, wherein the draft model adjusts model parameters of the draft model based on predicted loss of the plurality of tokens during training.
  15. 15. An apparatus for generating text for performing a multi-round iterative process, comprising the following means for performing any round of iteration: The generation unit is used for generating a draft word element sequence by utilizing a draft model aiming at a current word element sequence, wherein the current word element sequence comprises an input word element sequence of a target task and an accepted generated word element sequence, and the draft model comprises a plurality of first network layers; And the verification unit is used for verifying whether each draft word element in the draft word element sequence generated by the generation unit is correct or not by utilizing a target model, adding each draft word element before the first draft word element verified to be incorrect into the generation word element sequence, wherein the generation word element sequence is used for forming a generation text of the target task, and the target model has more parameters relative to the draft model.
  16. 16. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-14.
  17. 17. A computing device comprising a memory having executable code stored therein and a processor, which when executing the executable code, implements the method of any of claims 1-14.

Description

Method and device for generating text Technical Field One or more embodiments of the present description relate to the field of computers, and more particularly, to a method and apparatus for generating text. Background The large language model (large language model, LLM) refers to a computational model constructed based on deep learning techniques, especially neural networks, that is capable of understanding and generating natural language. Such models typically have a large number of parameters that capture complex patterns and contextual relationships in the language and generate natural language text based on the given text, widely used for natural language processing tasks such as text generation, translation, and question-answering systems, etc. In recent years, LLM has made a breakthrough progress in the field of natural language processing and has demonstrated strong capabilities in various applications. However, as model scale continues to expand, the computational resources and time required for its reasoning process also increases dramatically, which becomes a major bottleneck for the widespread deployment and real-time application of LLMs. To solve this problem, researchers have proposed various inference acceleration techniques, of which speculative decoding is a representative method. In the prior art, when adopting speculative decoding, the trade-off problem between performance and efficiency exists, and the reasoning speed of a large language model can not be effectively improved. Disclosure of Invention One or more embodiments of the present disclosure describe a method and apparatus for generating text, which can effectively improve the inference speed of a large language model. In a first aspect, a method of generating text is provided, comprising a multiple round iterative process, wherein any round of iteration comprises: Generating a draft word element sequence by utilizing a draft model aiming at a current word element sequence, wherein the current word element sequence comprises an input word element sequence of a target task and an accepted generated word element sequence, and the draft model comprises a plurality of first network layers; and verifying whether each draft word element in the draft word element sequence is correct or not by using a target model, adding each draft word element before the first draft word element verified to be incorrect into the generated word element sequence, wherein the generated word element sequence is used for forming a generated text of the target task, and the target model has more parameters relative to the draft model. In one possible implementation, the first network layer includes a self-attention sub-layer and a multi-layer perceptron. In one possible implementation manner, the dynamic calculation layer comprises a self-attention sub-layer, a multi-layer perceptron and a routing unit, wherein the routing unit is used for outputting importance scores of each word element according to the current representation of each word element in the current word element sequence, selecting k word elements with the highest scores to route to the self-attention sub-layer and the multi-layer perceptron, and skipping the self-attention sub-layer and the multi-layer perceptron by the rest word elements. Further, the routing unit adopts a feed-forward network or a linear network. In one possible implementation, the last first network layer of the plurality of first network layers is not a dynamic calculation layer, and the remaining first network layers are dynamic calculation layers. Further, the plurality of first network layers comprise at least two dynamic calculation layers, and k values of different dynamic calculation layers are decreased according to the processing sequence of the input data. In one possible implementation, the target model comprises a plurality of second network layers, and at least one second network layer of the plurality of second network layers is the dynamic calculation layer. Further, the first network layer or the second network layer is a transducer layer. In one possible implementation manner, the input of the draft model includes a character representation of each character in the current character sequence, and the character representation of any target character is a superposition of an initial representation of the target character and a feature of a previous character of the target character output by a plurality of hidden layers of the target model. In one possible implementation, the draft model and the target model are models obtained by respectively pre-training. In one possible embodiment, the draft model is a model obtained by performing distillation learning by learning an output probability distribution or a hidden state of the target model. Further, the parameters of the routing unit are trained in an end-to-end manner with the entire draft model. Further, the training loss of the draft model comprises a cross