EP-4740130-A1 - EFFICIENT TRAINING AND UTILIZATION OF LARGE LANGUAGE MODELS

EP4740130A1EP 4740130 A1EP4740130 A1EP 4740130A1EP-4740130-A1

Abstract

Implementations relate to a method implemented by one or more processors, the method including: receiving natural language (NL) based input associated with a client device; generating, using a large language model (LLM) and based on processing the NL based input, LLM output; determining, based on the LLM output, a sequence of LLM responses, the sequence of LLM responses including at least one intermediate LLM response and a final LLM response. In some implementations, the method may further include causing the final LLM response to be rendered at the client device. In additional or alternative implementations, the method may further include storing, as an instance of training data for fine-tuning the LLM or an additional LLM, the NL based input along with the final LLM response.

Inventors

MISHRA, SWAROOP
KUANG, Chenkai
SONG, XINYING
CHENG, HENG-TZE
CHI, ED H.
LE, QUOC
KOTIKALAPUDI, RAGHA
POTLURI, Sahitya
BOS, TAYLOR
LI, Yaguang
LIN, HANZHAO
Zheng, Steven
DU, YU
ZHU, CHEN

Assignees

Google LLC

Dates

Publication Date: 20260513
Application Date: 20240624

Claims (20)

1. A method implemented by one or more processors, the method comprising: receiving natural language (NL) based input associated with a client device; generating, using a large language model (LLM) and based on processing the NL based input, LLM output; determining, based on the LLM output, a sequence of LLM responses, the sequence of LLM responses comprising at least one intermediate LLM response and a final LLM response; and causing the final LLM response to be rendered at the client device.
2. The method of claim 1, wherein generating the LLM output is performed using a single inference call to the LLM.
3. The method of claim 1 or claim 2, wherein the sequence of LLM responses comprises a plurality of sequential intermediate LLM responses and the final LLM response, wherein each subsequent one of the plurality of sequential intermediate LLM responses is generated subsequent to a preceding one of the plurality of sequential intermediate LLM responses.
4. The method of any preceding claim, further comprising: determining, based on the LLM output, a critique response of the at least one intermediate LLM response, wherein the final LLM response is generated based at least in part on the critique response of the intermediate LLM response that immediately precedes the final LLM response in the sequence of LLM responses.
5. The method of claim 4, wherein the critique response comprises an analysis of the at least one intermediate LLM response.
6. The method of claim 4 or claim 5, wherein the critique response comprises an indication of areas for improvement for the at least one intermediate LLM response.
7. The method of any preceding claim, wherein generating, using the LLM and based on processing the NL based input, the LLM output comprises: generating an LLM input based on the NL based input; and processing, using the LLM, the LLM input to generate the LLM output.
8. The method of claim 7, wherein the LLM input comprises a plurality of requests and a plurality of fields for output that are responsive to the requests, and wherein the LLM output is indicative of output that is responsive to the requests to be entered into each of the fields.
9. The method of claim 8, wherein the plurality of requests comprises a first request based on the NL based input, at least one second request for generating the at least one intermediate LLM response, and a third request for generating the final LLM response.
10. The method of claim 8 or claim 9, wherein the plurality of requests further comprises at least one fourth request for generating a critique response for the at least one intermediate LLM response.
11. The method of any one of claims 7 to 10, wherein generating the LLM input is based on a predefined template.
12. The method of claim 11, wherein generating the LLM input further comprises modifying the template based on modification data.
13. The method of claim 12, wherein the modification data is based on one or more of: the NL based input, information associated with a user of the client device, or context data.
14. The method of any preceding claim, further comprising: bypassing rendering of the at least one intermediate LLM response at the client device.
15. The method of any preceding claim, further comprising: causing the sequence of LLM responses to be rendered sequentially at the client device.
16. The method of any preceding claim, further comprising: causing reasoning information to be rendered at the client device, wherein the reasoning information is based on a critique response of the at least one intermediate LLM response determined from the LLM output.
17. A method implemented by one or more processors, the method comprising: obtaining natural language (NL) based input; generating, using a large language model (LLM) and based on processing the NL based input, LLM output; determining, based on the LLM output, a plurality of LLM responses, the plurality of LLM responses comprising an initial LLM response and a final LLM response; and storing, as an instance of training data for fine-tuning the LLM or an additional LLM, the NL based input along with the final LLM response.
18. The method of claim 17, wherein generating the LLM output is performed using a single inference call to the LLM.
19. The method of claim 17 or claim 19, wherein the sequence of LLM responses comprises a plurality of sequential intermediate LLM responses and the final LLM response, wherein each subsequent one of the plurality of sequential intermediate LLM responses is generated subsequent to a preceding one of the plurality of sequential intermediate LLM responses.
20. The method of any one of claims 17 to 19, further comprising: determining, based on the LLM output, a critique response of the at least one intermediate LLM response, wherein the final LLM response is generated based at least in part on the critique response of the intermediate LLM response that immediately precedes the final LLM response in the sequence of LLM responses.

Description

EFFICIENT TRAINING AND UTILIZATION OF LARGE LANGUAGE MODELS BACKGROUND [0001] Large language models (LLMs) are particular types of machine learning models that can perform various natural language processing (NLP) tasks, such as language generation, machine translation, and question-answering. These LLMs are typically trained on enormous amounts of diverse data including data from, but not limited to, webpages, electronic books, software code, electronic news articles, and machine translation data. Accordingly, these LLMs leverage the underlying data on which they were trained in performing these various NLP tasks. For instance, in performing a language generation task, these LLMs can process a natural language (NL) based input that is received from a client device, and generate an NL based output that is responsive to the NL based input and that is to be rendered at the client device. [0002] In some cases, an LLM can include millions of parameters, hundreds of millions of parameters, billions of parameters, or even one hundred billion or more parameters. As such, given the large numbers of parameters included in an LLM, performance of NLP tasks using an LLM can consume relatively large amounts of resources (e.g., in terms of computing resources used in completing the NLP task, time taken to complete performance of the NLP task, energy consumed to complete performance of the NLP task, etc.). Furthermore, again owing to the size of LLMs, it can be difficult to adequately train an LLM such that it can reliably perform a given NLP task according to that task's respective constraints. It is therefore beneficial in terms of computational resource usage for LLMs to generate responses to NL based inputs that do not necessitate additional follow-up NL based inputs. SUMMARY [0003] Implementations described herein can serve to reduce the number of follow-up NL based inputs that may be received by an LLM. Although any given user may decide to provide a follow-up NL based input, any "on average" reduction in the number of follow-up NL based inputs can be hugely beneficial in terms of computational resource usage. [0004] More specifically, some implementations described herein relate to utilizing an LLM to generate a sequence of responses to an NL based input in a single inference call. Since each of the responses is generated at least in part based on the preceding response (e.g., by virtue of an attention mechanism or other memory employed by the LLM), it can be assumed that each subsequent response is an improvement on the preceding response. Some of these implementations described herein relate to using the described techniques at inference time, to provide an improved response to a user that is responsive to the user's NL based input (e.g., such that a likelihood of the user providing a follow-up NL based input after receiving the improved response is reduced). Some additional or alternative implementations described herein relate to using the described techniques to generate training data by storing the NL based input along with an improved response that is responsive to the NL based input as an instance of training data. This training data can be used to fine-tune an LLM, (e.g., such that responses generated using the fine-tuned LLM can be less likely to result in follow-up NL based inputs). [0005] Some implementations described herein include an LLM being used to process an NL based input to generate a sequence of responses, including at least one intermediate response and a final response, in a single decoding step (or in other words, in a single inference call to the LLM). The LLM can generate the sequence of responses in a single decoding step based on being provided with an LLM input that includes the NL based input. The LLM input can also include, for example, requests to generate the at least one intermediate response and a request to generate the final response. For instance, the LLM input can be formatted according to a template that is not provided by the user that provided the NL based input, and can be provided to the LLM along with the NL based input even without the user's knowledge. The template can include, for example, a space or entry field after each of the requests to prompt the LLM to fill the space or entry field with output that is responsive to the respective requests. Each of the responses can then be generated, in turn, using the LLM, by taking into account the preceding responses (e.g., via an attention mechanism or other memory employed by the LLM). In this way, each subsequent response can be improved, or refined, relative to the preceding response. It is to be noted that, in some cases, it can be determined that further improvement of a response is not necessary. For instance, if the appropriate response to an NL based input is either "yes" or "no", or if it is determined that a particular response (e.g., prior to the final response) is correct, then no improvement in the subsequ