US-12620394-B2 - Query response interface with server side generative model(s)

US12620394B2US 12620394 B2US12620394 B2US 12620394B2US-12620394-B2

Abstract

Various implementations include processing, at a client device, an instance of audio data capturing a user voice query using an automatic speech recognition model to generate a sequence of instances of tokenizable query text. In many implementations, one or more instances of the sequence can be transmitted to a remote computing system prior to generating the entire sequence. In a variety of implementations, each instance in the sequence can be processed using a generative model which includes a streaming multi-head attention portion. Responsive output can be transmitted from the remote computing system to the client device, where the client device renders the responsive output to the user. In many implementations, the time between the user speaking the user query and the client device rendering the responsive output is reduced, thus decreasing latency in the system.

Inventors

Dongeek Shin

Assignees

GOOGLE LLC

Dates

Publication Date: 20260505
Application Date: 20240422

Claims (18)

1 . A system comprising: memory storing instructions; and one or more processors that execute the instructions, stored in the memory, to: identify an instance of audio data capturing a user voice query captured via one or more microphones of a client device; generate, at the client device, a sequence of instances of tokenizable query text by processing the instance of audio data using an automatic speech recognition (ASR) model, where the sequence of instances of tokenizable query text are a text representation of the user voice query; for each instance, and in the sequence of instances of tokenizable query text: transmit the sequence of instances of tokenizable query text to a computing system remote from the client device, where one or more of the sequence of instances of tokenizable query text are transmitted from the client device to the computing system before generation of an entire sequence has completed; process the instance using a generative model (GM) at the computing system, wherein the processing the instance using the GM at the computing system comprises: update one or more portions of an input layer of the GM based on the instance; generate streamed multi-head attention output based on processing the updated one or more portions of the input layer using the GM; while continuing to receive subsequent instances in the sequence, transmitted from the client device to the computing system, generate, at the computing system, streaming output responsive to the user voice query based on processing the streamed multi-head attention output using the GM; transmit, from the computing system to the client device, one or more portions of the streaming output responsive to the user voice query; and render output based on the streaming output responsive to the user voice query via one or more user interface output devices of the client device.
2 . The system of claim 1 , wherein the GM is a large language model.
3 . The system of claim 1 , wherein the GM is a decoder portion of a transformer model.
4 . The system of claim 1 , wherein the one or more user interface output devices of the client device are one or more speakers, and wherein rendering output based on the streaming output responsive to the user voice query via the one or more speakers of the client device comprises: generating, at the client device, output audio data responsive to the user voice query based on processing the streaming output responsive to the user voice query using a text to speech model; and rendering the output audio data responsive to the user voice query via the one or more speakers of the client device.
5 . The system of claim 1 , wherein the one or more user interface output devices of the client device are one or more display devices, and wherein rendering output based on the streaming output responsive to the user voice query via the one or more display devices of the client device comprises: rendering text output based on the streaming output responsive to the user voice query via the one or more display devices of the client device.
6 . The system of claim 1 , wherein the sequence of instances of tokenizable query text are a sequence of instances of words, characters, or word pieces.
7 . A method implemented by one or more processors, the method comprising: identifying an instance of audio data capturing a user voice query captured via one or more microphones of a client device; generating, at the client device, a sequence of instances of tokenizable query text by processing the instance of audio data using an automatic speech recognition model, where the sequence of instances of tokenizable query text are a text representation of the user voice query; for each instance, and in the sequence of instances of tokenizable query text: transmitting the sequence of instances of tokenizable query text to a computing system remote from the client device, where one or more of the instances are transmitted before generation of an entire sequence has completed; receiving, from the computing system, streaming output responsive to the user voice query, wherein the streaming output responsive to the user voice query is generated at the computing system by processing the sequence of instances of tokenizable query text using a generative model (GM) which includes streamed multi-head attention to generate the streaming output responsive to the user voice query; and rendering output based on the streaming output responsive to the user voice query via one or more user interface output devices of the client device.
8 . The method of claim 7 , wherein the GM is a large language model.
9 . The method of claim 7 , wherein the GM is a decoder portion of a transformer model.
10 . The method of claim 7 , wherein the one or more user interface output devices of the client device are one or more speakers, and wherein rendering output based on the streaming output responsive to the user voice query via the one or more speakers of the client device comprises: generating, at the client device, output audio data responsive to the user voice query based on processing the streaming output responsive to the user voice query using a text to speech model; and rendering the output audio data responsive to the user voice query via the one or more speakers of the client device.
11 . The method of claim 7 , wherein the one or more user interface output devices of the client device are one or more display devices, and wherein rendering output based on the streaming output responsive to the user voice query via the one or more display devices of the client device comprises: rendering text output based on the streaming output responsive to the user voice query via the one or more display devices of the client device.
12 . The method of claim 7 , wherein the sequence of instances of tokenizable query text are a sequence of instances of words, characters, or word pieces.
13 . A method implemented by one or more processors, the method comprising: receiving, at a computing system remote from a client device, a sequence of instances of tokenizable query text, wherein the sequence of instances of tokenizable query text are a text representation of a user voice query generated at the client device, where the sequence of instances of tokenizable query text are generated by processing audio data capturing the user voice query using an automatic speech recognition model, and wherein one or more instances of the sequence of instances of tokenizable query text are transmitted before generation of the entire sequence at the client device has completed; for each instance, and in the sequence of instances of tokenizable query text: processing the instance using a generative model (GM), wherein the processing the instance using the GM comprises: updating one or more portions of an input layer of the GM based on the instance; generating streamed multi-head attention output based on processing the updated one or more portions of the input layer using the GM; while continuing to receive subsequent instances in the sequence, generating streaming output responsive to the user voice query based on the processing the streamed multi-head attention output using the GM; and transmitting one or more portions of the streaming output responsive to the user voice query to the client device.
14 . The method of claim 13 , wherein the GM is a large language model.
15 . The method of claim 13 , wherein the GM is a decoder portion of a transformer model.
16 . The method of claim 13 , wherein one or more user interface output devices of the client device are one or more speakers, and wherein rendering output based on the streaming output responsive to the user voice query via the one or more speakers of the client device comprises: generating, at the client device, output audio data responsive to the user voice query based on processing the streaming output responsive to the user voice query using a text to speech model; and rendering the output audio data responsive to the user voice query via the one or more speakers of the client device.
17 . The method of claim 13 , wherein the one or more user interface output devices of the client device are one or more display devices, and wherein rendering output based on the streaming output responsive to the user voice query via the one or more display devices of the client device comprises: rendering text output based on the streaming output responsive to the user voice query via the one or more display devices of the client device.
18 . The method of claim 13 , wherein the sequence of instances of tokenizable query text are a sequence of instances of words, characters, or word pieces.

Description

BACKGROUND Humans may engage in human-to-computer dialogs with interactive software applications referred to herein as “automated assistants” (also referred to as “chatbots”, “interactive personal assistants”, “intelligent personal assistants”, “personal voice assistants”, “conversational agents”, etc.). Automated assistants typically reply upon a pipeline of components for interpreting and responding to natural language (NL) based inputs received during a dialog session. Generative models (GMs), such as large language models (LLMs), are particular types of machine learning models that are trained on enormous amounts of diverse data and that can perform various natural language processing (NLP) tasks. Recent developments have integrated aspects of LLMs into this pipeline of components for interpreting and responding to the NL based inputs. Generally, a dialog session with an automated assistant that is integrated with aspects of LLMs is initiated by a user providing a NL based input, and the automated assistant can't generate a response to the NL based inputs using the aforementioned pipeline of components. SUMMARY Techniques described herein are directed towards reducing latency during a dialog session between a user and a client device, where at least part of the output responsive to a user voice query is generated using a generative model at a computing system remote from the client device. For example, a user can speak a user voice query, where audio data (e.g., audio data capturing the user voice query) can be captured via one or more microphones of a client device (e.g., a mobile phone). In some implementations, the audio data can be processed using an automatic speech recognition (ASR) model to generate a text representation of the user voice query. In some of those implementations, the ASR model can be stored locally at the client device, and processing of the audio data using the ASR model to generate the text representation of the user voice query can occur at the client device. In some implementations, the text representation of the spoken utterance can be generated as a sequence of instances of tokenizable query text. As used herein, tokenizable text is text that can be broken into smaller portions (e.g., tokens) such as sentences, words, word pieces, characters, etc. Additionally or alternatively, tokenizable text as used herein can be text represented by a sequence of instances of tokens such as a sequence of sentences, a sequence of words, a sequence of word pieces, a sequence of characters, etc. In some implementations, the instances in the sequence can be transmitted to the remote computing system after they are generated, and prior to generating the entire sequence. For example, sequence of instances of tokenizable query text can be a sequence of characters, a sequence of words, a sequence of word pieces, a sequence of one or more additional or alternative. One or more of the instances can be transmitted to the remote computing system as soon as they are generated and/or soon after they are generated, where the one or more instances of tokenizable query text in the sequence are transmitted prior to processing the entire instance of audio data using the ASR model. In existing techniques, the system waits until the entire text representation of the user query is generated before transmitting the text representation to the remote computing system (e.g., input side batching of the text representation of the user query). In some implementations, the generative model is stored at the remote computing system to support the heavy computational resources necessary to process user query text using the generative model. However, waiting to transmit the complete text representation to the remote computing system increases latency in the system. For example, input side transcription batching can cause a bottleneck in the processing of the user query text to generate responsive output. In contrast, implementations described herein can process one or more instances of the sequence to the remote computing system before generation of the entire sequence at the client device. By transmitting one or more portions of the query text to the remote computing system while the client device is still processing the audio data, the computing system can begin to process the query text while the client device is concurrently processing the audio data to generate the full transcription of the user query. In some implementations, the generative model can include a streamed multi-head attention layer to process the tokenized instances of query text while the full transcription of the user query is generated at the client device. Once the remote computing system has received the entire sequence of instances of tokenizable query text, output from the streamed multi-head attention layer can be processed using one or more additional layers of the generative model to generate output responsive to the user query. The output re