WO-2026092843-A1 - PARAMETER-FREE METHOD FOR EFFICIENT AND ACCURATE LLM INFERENCE ACCELERATION VIA SPECULATIVE DECODING
Abstract
In some examples, apparatus and methods are provided for selecting a draft token sequence for verification by using a large language model, LLM. Different sources of statistics on text data (prompt, generated output, large dataset of text data) can be utilized in order to choose candidates to use for speculative decoding via look-ups.
Inventors
- MARZOLLO, Michele
- MUELLER, LORENZ
- ZHUANG, Jiawei
- ROEMER, Niklas
- CAVIGELLI, LUKAS
Assignees
- HUAWEI TECHNOLOGIES CO., LTD.
Dates
- Publication Date
- 20260507
- Application Date
- 20241030
Claims (15)
- 1. Apparatus comprising: at least one processor; and at least one memory including computer program code (902) for selecting a draft token sequence for verification by using a large language model, LLM, wherein a token comprises a text portion, the at least one memory and computer program code being configured to, with the at least one processor, cause the apparatus to: select an input token sequence from an input for the LLM; generate a first set of candidate tokens, wherein the first set of candidate tokens are selected from a first datastore on the basis of the input token sequence, wherein the first datastore comprises a corpus of text data; generate a second set of candidate tokens, wherein the second set of candidate tokens are selected from a second datastore on the basis of the input token sequence, wherein the second datastore comprises data representing the input for the LLM and a set of verified token sequences, wherein the set of verified token sequences comprises a set of tokens previously output by the LLM in response to the input for the LLM; select, from the first and second sets of candidate tokens, first and second subsets of candidate tokens, wherein the first and second subsets of candidate tokens comprise tokens having a probability of representing a continuation to the input for the LLM above a predefined threshold value; generate a tree data structure using a combination of the first and second subsets of candidate tokens, wherein the tree data structure comprises a set of nodes, each node comprising a token of the first and second subsets of candidate tokens, wherein at least a subset of the nodes are logically linked to one another with edges, wherein each edge represents a probability value associated with a continuation from one node to the next as the tree data structure is traversed from a root node of the tree data structure; and select the draft token sequence for input to the LLM by using the tree data structure.
- 2. The apparatus as claimed in claim 1 , wherein the at least one memory and computer program code are configured to, with the at least one processor, further cause the apparatus to: assign a first weight to at least a proportion of the corpus of text data of the first datastore; assign a second weight to at least a proportion of the data representing the input for the LLM and the set of verified token sequences of the second datastore; and modify the probability associated with the first and second subsets of candidate tokens on the basis of the first and second weights.
- 3. The apparatus as claimed in claim 1 or 2, wherein the at least one memory and computer program code are configured to, with the at least one processor, further cause the apparatus to: provide, to the LLM, the generated tree structure and a corresponding mask, wherein the corresponding mask comprises an indication of valid token selections from the tree data structure for use by the LLM.
- 4. The apparatus as claimed in any preceding claim, wherein the at least one memory and computer program code are configured to, with the at least one processor, further cause the apparatus to: generate, using the corpus of text data of the first datastore, a suffix array comprising a sorted array of all suffixes, wherein a continuation of a selected prefix is constructed by sampling over continuations of the prefix found in the suffix array.
- 5. The apparatus as claimed in any preceding claim, wherein the at least one memory and computer program code are configured to, with the at least one processor, further cause the apparatus to: organize the data representing the input for the LLM and the set of verified token sequences of the second datastore into a second tree data structure comprising nodes connected by edges, wherein each node comprises a prefix corresponding to a portion of text of the input for the LLM and the set of verified token sequences and wherein each edge represents a number of occurrences of the portion of text of the input for the LLM and the set of verified token sequences of the second datastore for a given path of the second tree data structure.
- 6. The apparatus as claimed in claim 4 or 5, wherein the at least one memory and computer program code are configured to, with the at least one processor, further cause the apparatus to: match the input token sequence by searching, in at least one of the first and second datastores, for a set of prefix continuations, wherein a probability of a prefix continuation comprises a measure for the number of occurrences of a prefix and a continuation for the prefix, divided by the number of occurrences of the prefix.
- 7. A method for selecting a draft token sequence for verification using a large language model, LLM, wherein a token comprises a text portion, the method comprising: generating an input token sequence from an input for the LLM; generating a first set of candidate tokens, wherein the first set of candidate tokens are selected from a first datastore on the basis of the input token sequence, wherein the first datastore comprises a corpus of text data; generating a second set of candidate tokens, wherein the second set of candidate tokens are selected from a second datastore on the basis of the input token sequence, wherein the second datastore comprises data representing the input for the LLM and a set of verified token sequences, wherein the set of verified token sequences comprises a set of tokens previously output by the LLM in response to the input for the LLM; selecting, from the first and second sets of candidate tokens, first and second subsets of candidate tokens, wherein the first and second subsets of candidate tokens comprise tokens having a probability of representing a continuation to the input for the LLM above a predefined threshold value; generating a tree data structure using a combination of the first and second subsets of candidate tokens, wherein the tree data structure comprises a set of nodes, each node comprising a token of the first and second subsets of candidate tokens, wherein at least a subset of the nodes are logically linked to one another with edges, wherein each edge represents a probability value associated with a continuation from one node to the next as the tree data structure is traversed from a root node of the tree data structure; and selecting the draft token sequence for input to the LLM by using the tree data structure.
- 8. The method as claimed in claim 7, further comprising: assigning a first weight to at least a proportion of the corpus of text data of the first datastore; assigning a second weight to at least a proportion of the data representing the input for the LLM and the set of verified token sequences of the second datastore; and on the basis of the first and second weights, modifying the probability associated with the first and second subsets of candidate tokens.
- 9. The method as claimed in claim 7 or 8, further comprising: providing, to the LLM, the generated tree structure and a corresponding mask, wherein the corresponding mask comprises an indication of valid token selections from the tree data structure for use by the LLM.
- 10. The method as claimed in any of claims 7 to 9, further comprising: generating, using the corpus of text data of the first datastore, a suffix array comprising a sorted array of all suffixes from which the tree of a selected prefix is constructed by sampling over continuations of the prefix found in the suffix array.
- 11. The method as claimed in any of claims 7 to 10, further comprising: organizing the data representing the input for the LLM and the set of verified token sequences of the second datastore into a second tree data structure comprising nodes connected by edges, wherein each node comprises a prefix corresponding to a portion of text of the input for the LLM and the set of verified token sequences and wherein each edge represents a number of occurrences of the portion of text of the input for the LLM and the set of verified token sequences of the second datastore for a given path of the second tree data structure.
- 12. The method as claimed in claim 10 or 11, further comprising: matching the input token sequence by searching, in at least one of the first and second datastores, for a set of prefix continuations, wherein a probability of a prefix continuation comprises a measure for the number of occurrences of a prefix and a continuation for the prefix, divided by the number of occurrences of the prefix.
- 13. A computer program stored on a non-transitory medium and including code instructions, which, when executed on one or more processors, cause the one or more processors to execute a method for selecting a draft token sequence for verification using a large language model, LLM, wherein a token comprises a text portion, the method comprising: selecting an input token sequence from an input for the LLM; generating a first set of candidate tokens, wherein the first set of candidate tokens are selected from a first datastore on the basis of the input token sequence, wherein the first datastore comprises a corpus of text data; generating a second set of candidate tokens, wherein the second set of candidate tokens are selected from a second datastore on the basis of the input token sequence, wherein the second datastore comprises data representing the input for the LLM and a set of verified token sequences, wherein the set of verified token sequences comprises a set of tokens previously output by the LLM in response to the input for the LLM; selecting, from the first and second sets of candidate tokens, first and second subsets of candidate tokens, wherein the first and second subsets of candidate tokens comprise tokens having a probability of representing a continuation to the input for the LLM above a predefined threshold value; generating a tree data structure using a combination of the first and second subsets of candidate tokens, wherein the tree data structure comprises a set of nodes, each node comprising a token of the first and second subsets of candidate tokens, wherein at least a subset of the nodes are logically linked to one another with edges, wherein each edge represents a probability value associated with a continuation from one node to the next as the tree data structure is traversed from a root node of the tree data structure; and selecting the draft token sequence for input to the LLM by using the tree data structure.
- 14. The computer program as claimed in claim 13, wherein the code instructions, which, when executed on one or more processors, cause the one or more processors to execute a method, further comprising: assigning a first weight to at least a proportion of the corpus of text data of the first datastore; assigning a second weight to at least a proportion of the data representing the input for the LLM and the set of verified token sequences of the second datastore; and on the basis of the first and second weights, modifying the probability associated with the first and second subsets of candidate tokens.
- 15. The computer program as claimed in claim 13 or 14, wherein the code instructions, which, when executed on one or more processors, cause the one or more processors to execute a method, further comprising: providing, to the LLM, the generated tree structure and a corresponding mask, wherein the corresponding mask comprises an indication of valid token selections from the tree data structure for use by the LLM. 17
Description
PARAMETER-FREE METHOD FOR EFFICIENT AND ACCURATE LLM INFERENCE ACCELERATION VIA SPECULATIVE DECODING TECHNICAL FIELD The present disclosure relates, in general, to inference in large language models (LLM). More specifically, although not exclusively, the present disclosure relates to speculative decoding in natural language processing. BACKGROUND As the demand for LLMs increases, they become ever more complex, with (currently) hundreds of billions of parameters. During the inference phase of an LLM, output text is generated by the model in response to a text input query. In order to reduce the latency between an input query and generation of the corresponding output, which increases as the complexity of the underlying models increases, various techniques can be implemented. One such technique is speculative decoding, in which there is an assumption that, when generating the next token in response to an input query, the choice of the token to pick from a vocabulary (the set of all possible tokens) is simple, and should therefore not require the full complexity of the LLM to be determine. Speculative Decoding consists in trying to predict more than one token at a time, by proposing multiple tokens and using any free computing power to verify, in parallel, the proposed tokens. SUMMARY An objective of the present disclosure is to provide a parameter-free method and apparatus for efficient and accurate LLM inference acceleration via speculative decoding. The foregoing and other objectives are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the Figures. A first aspect of the present disclosure provides an apparatus comprising at least one processor, and at least one memory including computer program code (902) for selecting a draft token sequence for verification by using a large language model, LLM, wherein a token comprises a text portion, the at least one memory and computer program code being configured to, with the at least one processor, cause the apparatus to select an input token sequence from an input for the LLM, generate a first set of candidate tokens, wherein the first set of candidate tokens are selected from a first datastore on the basis of the input token sequence, wherein the first datastore comprises a corpus of text data, generate a second set of candidate tokens, wherein the second set of candidate tokens are selected from a second datastore on the basis of the input token sequence, wherein the second datastore comprises data representing the input for the LLM and a set of verified token sequences, wherein the set of verified token sequences comprises a set of tokens previously output by the LLM in response to the input for the LLM, select, from the first and second sets of candidate tokens, first and second subsets of candidate tokens, wherein the first and second subsets of candidate tokens comprise tokens having a probability of representing a continuation to the input for the LLM above a predefined threshold value, generate a tree data structure using a combination of the first and second subsets of candidate tokens, wherein the tree data structure comprises a set of nodes, each node comprising a token of the first and second subsets of candidate tokens, wherein at least a subset of the nodes are logically linked to one another with edges, wherein each edge represents a probability value associated with a continuation from one node to the next as the tree data structure is traversed from a root node of the tree data structure, and select the draft token sequence for input to the LLM by using the tree data structure. Using different sources of statistics on text data (such as an input prompt (or part thereof), generated output, large dataset of text data etc.) candidates to use for SD via look-ups can be chosen. Different sources are good at predicting tokens in different settings, and using the input and a large dataset presents almost complementary prediction capabilities, boosting the acceptance rate of candidate tokens. In an example, a calibration method to combine the data sources to achieve the best acceptance rate is provided. The calibration method is easy to run, and gives better acceptance rates than other approaches. As candidates coming from the large dataset can be stored on disk for fast retrieval (or subsampling if using a suffix array implementation), any disadvantages usually presented by SD in terms of the cost of retrieving candidates are reduced. That is, according to an example, the cost becomes negligible. In an implementation of the first aspect, the at least one memory and computer program code can be configured to, with the at least one processor, further cause the apparatus to assign a first weight to at least a proportion of the corpus of text data of the first datastore, assign a second weight to at least a proportion of the data representing the input for the LLM and