US-12619821-B2 - Expediting generative token production using speculative sampling, added guidance, and language models of different capacities

US12619821B2US 12619821 B2US12619821 B2US 12619821B2US-12619821-B2

Abstract

A technique accelerates the generative production of tokens using a target language model that operates in cooperation with a draft language model. The target language model is more capable, but slower, compared to the draft language model. In operation, the draft language model transforms prompt tokens into draft tokens. The target language model edits the draft tokens, e.g., by selecting zero, one, or more of the draft tokens, and by also predicting a next token to follow the draft token(s) (if any) that are selected. Further, the target language model produces guidance vector information. In a subsequent cycle, the draft language model uses the guidance vector information to produce an updated set of set of draft tokens. The guidance vector information informs the draft language model of the embedding space being used by the target language model. This achieves a more effective cooperative relation between the two models.

Inventors

Ayyoob IMANIGOOGHARI
Mohsen Fayyaz
Eric Chris Wolfgang SOMMERLADE

Assignees

MICROSOFT TECHNOLOGY LICENSING, LLC

Dates

Publication Date: 20260505
Application Date: 20231222

Claims (20)

1 . A method for accelerating generation of output tokens using a target language model, which operates in cooperation with a draft language model, comprising: receiving, by the target language model, a set of draft tokens produced by the draft language model based on, at least in part, prompt tokens provided to the draft language model; wherein the target language model and the draft language model are two different neural networks, and wherein the target language model has more parameters and is more accurate compared to the draft language model, and wherein the target language model consumes more memory and processing resources compared to the draft language model, and wherein the target language model is slower in operation compared to the draft language model; producing, using a first head neural network of the target language model, one or more target output tokens based on the prompt tokens and the set of draft tokens, the one or more target output tokens including zero, one, or more draft tokens chosen from among the set of draft tokens, and an additional target output token which is predicted by the target language model to follow the zero, one, or more draft tokens that are selected; generating, using a second head neural network of the target language model, guidance vector information based on the prompt tokens and the set of draft tokens, the second head neural network being different than the first head neural network; and forwarding the one or more target output tokens and the guidance vector information to the draft language model, a combination of the prompt tokens, the one or more target output tokens, and the guidance vector information being used by the draft language model as input tokens to be transformed into updated draft tokens.
2 . The method of claim 1 , wherein the target language model applies an attention operation to input tokens that are input to the target language model, and the draft language model applies an attention operation to input tokens that are input to the draft language model.
3 . The method of claim 1 , wherein a server system implements the target language model and a local system implements the local language model, the server system being accessible to the local system via a computer network.
4 . The method of claim 1 , wherein both the target language model and the draft language model are implemented by a same system.
5 . The method of claim 1 , wherein the producing accepts a particular draft token of the zero, one, or more draft tokens upon determining that a probability generated by the target language model for the particular draft token is greater than a probability generated by the draft language model for the particular draft token.
6 . The method of claim 1 , wherein the target language model operates by: transforming the prompt tokens and the set of draft tokens to hidden state information using a base target language model; transforming the hidden state information to output token probability information using the first head neural network, on basis of which the one or more target output tokens are produced; and transforming the hidden state information to the guidance vector information using the second head neural network.
7 . The method of claim 1 , further comprising producing the one or more target output tokens in a single forward pass of the target language model.
8 . The method of claim 1 , wherein the draft tokens in the set of draft tokens are produced auto-regressively by the draft language model.
9 . The method of claim 1 , further comprising transforming, using the draft language model, the combination of the prompt tokens, the one or more target output tokens, and the guidance vector information to the updated draft tokens, wherein the target language model has been trained based on a first loss measure that depends on a difference between first ground-truth information and the one or more target output tokens, and a second loss measure that depends on a difference between second ground-truth information and the updated draft tokens.
10 . The method of claim 9 , wherein the draft language model has also been trained based on the second loss measure.
11 . The method of claim 9 , wherein the first ground-truth information is text that is manually specified by a human reviewer as being correct, and wherein the second ground-truth information is text auto-regressively generated by the target language model.
12 . The method of claim 1 , further comprising switching to a mode in which the target language model is asked by the draft language model to generate an instance of guidance vector information for initial prompt tokens, and wherein generation of output tokens thereafter takes place based on the guidance vector information using the draft language model independent of interaction with the target language model.
13 . A computing system for using a draft language model to accelerate generation of output tokens using a target language model, comprising: an instruction data store for storing computer-readable instructions; and a processing system for executing the computer-readable instructions in the data store, to perform operations including: receiving a set of target output tokens produced by the target language model, and guidance vector information produced by the target language model, wherein a combination of the prompt tokens, the set of target output tokens, and the guidance vector information comprise input tokens; transforming the input tokens into draft tokens, wherein the target language model and the draft language model are two different neural networks, and wherein the draft language model has fewer parameters and is less accurate compared to the target language model, and wherein the draft language model consumes less memory and processing resources compared to the target language model, and wherein the draft language model is faster in operation compared to the target language model; and sending the draft tokens to the target language model, for use by the target language model in producing an updated set of target output tokens using a first head neural network and updated guidance vector information using a second head neural network that is different than the first head neural network.
14 . The computing system of claim 13 , wherein the updated set of target output tokens are produced by the target language model by selecting from among the draft tokens.
15 . The computing system of claim 13 , wherein the operations further comprise switching to a mode in which the target language model is asked by the draft language model to generate an instance of guidance vector information for initial prompt tokens, and wherein generation of output tokens thereafter takes place based on the guidance vector information using the draft language model independent of interaction with the target language model.
16 . A computer-readable storage medium for storing computer-readable instructions, a processing system executing the computer-readable instructions to perform operations, the operations comprising each of: receiving, by a target language model, a set of draft tokens produced by a draft language model based on, at least in part, prompt tokens provided to the draft language model; wherein the target language model and the draft language model are two different neural networks, and wherein the target language model has more parameters and is more accurate compared to the draft language model, and wherein the target language model consumes more memory and processing resources compared to the draft language model, and wherein the target language model is slower in operation compared to the draft language model; producing, using a first head neural network of the target language model, one or more target output tokens based on the prompt tokens and the set of draft tokens, the one or more target output tokens including zero, one, or more draft tokens chosen from among the set of draft tokens, and an additional target output token which is predicted by the target language model to follow the zero, one, or more draft tokens that are selected; generating, using a second head neural network of the target language model, guidance vector information based on the prompt tokens and the set of draft tokens, wherein the second head neural network is different than the first head neural network, and wherein a combination of the prompt tokens, the set of target output tokens, and the guidance vector information comprise input tokens; and transforming, using the draft language model, the input tokens to updated draft tokens, the target language model has been trained based on a first loss measure that depends on a difference between first ground-truth information and the one or more target output tokens, and a second loss measure that depends on a difference between second ground-truth information and the updated draft tokens.
17 . The computer-readable storage medium of claim 16 , wherein the draft language model has also been trained based on the second loss measure.
18 . The computer-readable storage medium of claim 16 , wherein the first ground-truth information is text that is manually specified by a human reviewer as being correct, and wherein the second ground-truth information is text auto-regressively generated by the target language model.
19 . The method of claim 12 , wherein the mode is selected automatically based on one or more factors, as expressed in one or more input signals.
20 . The computer-readable storage medium of claim 16 , wherein the target language model operates by: transforming the prompt tokens and the set of draft tokens to hidden state information using a base target language model; transforming the hidden state information to output token probability information using the first head neural network, on basis of which the one or more target output tokens are produced; and transforming the hidden state information to the guidance vector information using the second head neural network.

Description

BACKGROUND Large language models use a large number of parameters. In some cases, for instance, a large language model includes several billion parameters. The latency of language models grows with the size of the language models. As a consequence, a large language model may fail to provide a response in a sufficiently timely matter to satisfy the demands of some applications. The technical literature has proposed the use of smaller language models. But reducing the size of a language model also negatively impacts the model's ability to understand a broad range of queries. One way of addressing this constraint is by fine-tuning the smaller language model to perform particular tasks of interest to a user. This manner of customizing a language model, however, is resource-intensive and time-consuming, and does not yield a language model that is capable of satisfactorily performing other tasks for which it was not fine-tuned. SUMMARY A technique is described herein for accelerating the generative production of tokens using a target language model that operates in cooperation with a draft language model. The target language model is more capable, but slower, compared to the draft language model. In operation, the draft language model transforms prompt tokens into draft tokens. The target language model edits the draft tokens, e.g., by selecting zero, one, or more of the draft tokens, and by also predicting a next token to follow the draft token(s) (if any) that are selected. Further, the target language model produces guidance vector information. In a subsequent cycle, the draft language model uses the guidance vector information to produce an updated set of set of draft tokens. The guidance vector information gives the draft language model insight into the embedding space used by the target language model, and therefore enables more effective interaction between these two models. In many cases, the target language model will confirm at least some of the draft tokens produced by the draft language model as being correct. This allows the technique to adopt these tokens without the time-intensive need to auto-regressively generate these tokens using the target language model. Although these confirmed tokens have been auto-regressively generated using the draft language model, it takes considerably less time to produce tokens using the draft language model compared to the target language model. Thus, overall, the technique reduces the amount of time that is required to generatively produce tokens. According to some implementations, the target language model is provided by a server system, and each instantiation of the draft system is provided by a local system, such as a user computing device. In other implementations, a single system implements both the target language model and the draft language model. According to some implementations, the target language model edits the draft tokens in a single pass. The draft language model, on the other hand, auto-regressively produces the draft tokens. According to some implementations, in another mode of operation, the target language model only generates a single instance of guidance vector information at the outset of the processing of a query. The draft language model uses the guidance vector information to produce output tokens without asking the target language model to perform verification via token editing. This mode is faster than the above-summarized mode, but produces output tokens of lower quality compared to the above-summarized mode. The above-summarized technology is capable of being manifested in various types of systems, devices, components, methods, computer-readable storage media, data structures, graphical user interface presentations, articles of manufacture, and so on. This Summary is provided to introduce a selection of concepts in a simplified form; these concepts are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. BRIEF DESCRIPTION OF DRAWINGS FIG. 1 shows a method of generating tokens using speculative sampling (also referred to as speculative decoding), in a mode in which the generation of guidance vector information is turned off. The method makes use of a token-generating system that includes a target language model and a smaller and quicker draft language model. FIG. 2 shows an example of the first two phases shown in FIG. 1. FIG. 3 shows illustrative logic for editing draft tokens produced by a draft language model. FIG. 4 shows a method for generating tokens using speculative sampling, in a mode in which the generation of guidance vector information is turned on. FIG. 5 shows an example of the application of the method of FIG. 4. FIG. 6 shows a method of generating tokens in a model in which a target language model is consulted at the outset of decoding