JP-7857477-B2 - Using large-scale language models when generating automated assistant responses

JP7857477B2JP 7857477 B2JP7857477 B2JP 7857477B2JP-7857477-B2

Inventors

マーティン・バウムル
トゥシャン・アマラシリワルデナ
ロベルト・ピエラッチーニ
ヴィクラム・スリダール
ダニエル・デ・フライタス・アディワルダナ
ノーム・エム・シャジール
クォック・レ

Assignees

グーグルエルエルシー

Dates

Publication Date: 20260512
Application Date: 20250529
Priority Date: 20211122

Claims (18)

A method carried out by one or more processors, As part of an interaction session between the user of the client device and the automated assistant performed by the client device, A step of receiving a stream of audio data that captures the user's utterance, wherein the stream of audio data is generated by one or more microphones of the client device, and the utterance includes an assistant query. A step of determining a set of assistant outputs based on processing the stream of audio data, wherein each assistant output in the set of assistant outputs responds to the assistant query contained in the utterance, A step of processing the set of assistant outputs and the context of the dialogue session to generate a set of modified assistant outputs and to generate additional assistant queries related to the utterances based on the context of the dialogue session and based on the assistant queries contained in the utterances, wherein each of the modified assistant outputs in the set of modified assistant outputs reflects a first personality from among a plurality of separate personalities ; A method comprising the steps of: providing a given modified assistant output from the set of modified assistant outputs for presentation to the user, wherein the given modified assistant output includes additional assistant outputs in response to the additional assistant query .
The method according to claim 1, wherein the LLM is used to process the set of assistant outputs and the context of the dialogue session in order to generate the set of modified assistant outputs, each reflecting a first personality, thereby causing the LLM to adapt the set of assistant outputs to a first vocabulary associated with the first personality and generate the set of modified assistant outputs.
The method according to claim 2, wherein the first personality is separate from the second personality, and the first vocabulary associated with the first personality and used to generate the set of modified assistant outputs is separate from the second vocabulary associated with the second personality.
The step of providing the given modified assistant output for presentation to the user is: The steps include generating synthesized speech audio data that captures the given modified assistant output using a first set of prosodic properties associated with the first personality, The method according to any one of claims 1 to 3, comprising the step of causing the synthesized speech audio data capturing the given modified assistant output to be rendered audibly through one or more speakers of the client device.
The method according to claim 4, wherein the first personality is separate from the second personality, and the first set of prosodic properties associated with the first personality and used to generate the synthesized speech audio data is separate from the second set of prosodic properties associated with the second personality.
The method according to any one of claims 1 to 5, further comprising the step of selecting a given modified assistant output from the set of modified assistant outputs based on a probability distribution over a series of words or phrases generated using the LLM.
The method according to any one of claims 1 to 6, wherein the LLM is a first LLM specific to the first personality among a plurality of separate LLMs.
The method according to any one of claims 1 to 7, wherein the first personality is defined by the user in the settings of the software application.
As part of a subsequent interaction session between the user and the automated assistant conducted by the client device, A step of receiving a subsequent stream of audio data that captures the user's subsequent utterance, wherein the subsequent stream of audio data is generated by one or more microphones of the client device, and the subsequent utterance includes a subsequent assistant query. A step of determining a subsequent set of assistant outputs based on processing the subsequent stream of audio data, wherein each assistant output in the subsequent set of assistant outputs responds to a subsequent assistant query contained in the subsequent utterance, The steps include processing the subsequent set of assistant outputs and the subsequent context of the subsequent dialogue session using the LLM or additional LLM to generate a subsequent set of modified assistant outputs, each reflecting a second personality distinct from the first personality, The method according to any one of claims 1 to 8, comprising the step of providing a given later modified assistant output from the later set of modified assistant outputs for presentation to the user.
The method according to claim 9, wherein the subsequent set of assistant outputs and the subsequent context of the subsequent dialogue session are processed using the additional LLM to generate the subsequent set of modified assistant outputs, each reflecting a second personality distinct from the first personality, wherein the LLM is related to the first personality and the additional LLM is related to the second personality.
A method carried out by one or more processors, As part of an interaction session between the user of the client device and the automated assistant performed by the client device, A step of receiving a stream of audio data that captures the user's utterance, wherein the stream of audio data is generated by one or more microphones of the client device, and the utterance includes an assistant query. A step of determining a set of assistant outputs based on processing the stream of audio data, wherein each assistant output in the set of assistant outputs responds to the assistant query contained in the utterance, A step of processing the set of assistant outputs and the context of the dialogue session in order to generate a set of assistant outputs modified using one or more LLM outputs generated using a Large-Scale Language Model (LLM), and to generate additional assistant queries related to the utterance based on the context of the dialogue session and based on the assistant queries contained in the utterance , wherein each of the one or more LLM outputs is determined based at least in part on the context of the dialogue session and one or more of the assistant outputs contained in the set of assistant outputs, and the set of assistant outputs modified using the one or more LLM outputs is generated. The steps include: (i) generating a set of first personality responses based on the set of assistant outputs, (ii) the context of the dialogue session, and (iii) one or more first LLM outputs from the one or more LLM outputs that reflect a first personality among a plurality of separate personalities; A method comprising the steps of: providing a given modified assistant output from the set of modified assistant outputs for presentation to the user, wherein the given modified assistant output includes additional assistant outputs in response to the additional assistant query .
A method carried out by one or more processors, As part of an interaction session between the user of the client device and the automated assistant performed by the client device, A step of receiving a stream of audio data that captures the user's utterance, wherein the stream of audio data is generated by one or more microphones of the client device, and the utterance includes an assistant query. A step of determining a set of assistant outputs based on processing the stream of audio data, wherein each assistant output in the set of assistant outputs responds to the assistant query contained in the utterance, A step of processing the set of assistant outputs and the context of the dialogue session in order to generate a set of modified assistant outputs using one or more LLM outputs generated using a Large-Scale Language Model (LLM), and to generate additional assistant queries related to the utterance based on the context of the dialogue session and based on the assistant queries contained in the utterance , wherein each of the one or more LLM outputs is determined based at least in part on the context of the dialogue session and one or more of the assistant outputs contained in the set of assistant outputs , and each of the modified assistant outputs in the set of modified assistant outputs reflects a first personality from among a plurality of separate personalities , A method comprising the steps of: providing a given modified assistant output from the set of modified assistant outputs for presentation to the user, wherein the given modified assistant output includes additional assistant outputs in response to the additional assistant query .
A method carried out by one or more processors, As part of an interaction session between the user of the client device and the automated assistant performed by the client device, A step of receiving a stream of audio data that captures the user's utterance, wherein the stream of audio data is generated by one or more microphones of the client device, and the utterance includes an assistant query. A step of determining a set of assistant outputs based on processing the stream of audio data, wherein each assistant output in the set of assistant outputs responds to the assistant query contained in the utterance, A step of determining whether to modify one or more of the assistant outputs included in the set of assistant outputs based on the processing of the utterance, In response to a decision to modify one or more of the assistant outputs included in the set of assistant outputs, A step of processing the set of assistant outputs and the context of the dialogue session in order to generate a set of modified assistant outputs using one or more LLM outputs generated using a Large-Scale Language Model (LLM), and to generate additional assistant queries related to the utterance based on the context of the dialogue session and based on the assistant queries contained in the utterance , wherein each of the one or more LLM outputs is determined based at least in part on the context of the dialogue session and one or more of the assistant outputs contained in the set of assistant outputs , and each of the modified assistant outputs in the set of modified assistant outputs reflects a first personality from among a plurality of separate personalities , A method comprising the steps of: providing a given modified assistant output from the set of modified assistant outputs for presentation to the user, wherein the given modified assistant output includes additional assistant outputs in response to the additional assistant query .
The step of determining whether to modify one or more of the assistant outputs included in the set of assistant outputs based on the processing of the utterance is: The steps include: processing the stream of audio data using an ASR model to generate an automatic speech recognition (ASR) output stream; The steps include: using an NLU model to process the stream of ASR output in order to generate a stream of natural language understanding (NLU) data; A step of identifying the user's intent when providing the utterance based on the stream of NLU data, The method according to claim 13, further comprising the step of determining whether to modify the assistant output based on the user's intent when providing the utterance.
The method according to claim 13 or 14, wherein the step of determining whether to modify one or more of the assistant outputs included in the set of assistant outputs is further based on one or more computational costs associated with modifying one or more of the assistant outputs.
The method according to claim 15, wherein the one or more computational costs associated with modifying one or more of the assistant outputs include one or more of the following: battery consumption, processor consumption associated with modifying one or more of the assistant outputs, or latency associated with modifying one or more of the assistant outputs.
At least one processor, A system comprising: a memory that, when executed, stores instructions that enable at least one processor to operate to perform the method according to any one of claims 1 to 16 .
A non-temporary computer-readable storage medium that stores instructions, when executed, causing at least one processor to perform an operation corresponding to the method described in any one of claims 1 to 16 .

Description

Humans may engage in human-to-computer interactions with conversational software applications referred to herein as “Automated Assistants” (also known as “Chatbots,” “Conversational Personal Assistants,” “Intelligent Personal Assistants,” “Personal Voice Assistants,” “Conversational Agents,” etc.). Automated Assistants typically rely on a pipeline of components to interpret and respond to utterances. For example, an Automatic Speech Recognition (ASR) engine can process audio data corresponding to a user’s utterance to generate an ASR output, such as an ASR hypothesis of the utterance (i.e., a set of terms and/or other tokens). Furthermore, a Natural Language Understanding (NLU) engine can process the ASR output (or touch/typed input) to generate an NLU output, such as a request (e.g., intent) expressed by the user when providing the utterance (or touch/typed input), and optionally, slot values for parameters related to that intent. Finally, the NLU output may be processed by various performance components to generate performance outputs, such as response content that responds to the utterance and/or one or more actions that may be performed in response to the utterance. Generally, a conversational session with an automated assistant is initiated by the user providing an utterance, and the automated assistant can respond to the utterance using the aforementioned component pipeline. The user can continue the conversational session by providing additional utterances, and the automated assistant can again respond to additional utterances using the aforementioned component pipeline. In other words, these conversational sessions are generally sequence-based in that the user has a turn to provide an utterance, the automated assistant has a turn to respond to an utterance, the user has an additional turn to provide an additional utterance, and the automated assistant has an additional turn to respond to an additional utterance, and so on. However, from the user's perspective, these sequence-based conversational sessions can be unnatural because they do not reflect the way humans actually converse with each other. For example, if Person 1 provides an utterance to convey their initial thought to Person 2 during a dialogue session (e.g., "I'm going to the beach today"), Person 2 can consider that utterance in the context of the dialogue session when giving their response to Person 1 (e.g., "Sounds fun, what are you going to do at the beach?", "Nice, have you looked at the weather?"). In particular, when responding to Person 1, Person 2 can provide utterances that keep Person 1 naturally engaged in the dialogue session. In other words, rather than one person driving the dialogue session, both Person 1 and Person 2 can provide utterances that facilitate natural conversation. However, if the second person is replaced by an automated assistant in the above example, the automated assistant may not provide responses that keep the first person involved in the conversational session. For example, in response to the first person's utterance, "I'm going to the beach today," the automated assistant could take some action to facilitate the conversational session and/or provide some response, such as proactively asking the first person what they plan to do at the beach, proactively looking up the weather forecast for the beach the first person frequently visits and including the weather forecast in the response, or proactively making some inferences based on the weather forecast. Instead, it might simply respond with "sound fun" or "nice" without providing any additional responses to facilitate the conversational session. As a result, the responses provided by the automated assistant in response to the first person's utterances may not reflect a natural conversation between multiple people and may therefore not resonate with the first person. Furthermore, the first person may have to provide additional utterances to explicitly request information that the automated assistant could proactively provide (for example, a beach weather forecast), thus increasing the volume of utterances directed at the automated assistant and wasting the computing resources of the client device used to process these utterances. This is a block diagram illustrating various aspects of the present disclosure and of an exemplary environment in which the implementations disclosed herein may be implemented.This diagram illustrates a process flow that demonstrates the use of large-scale language models when generating assistant output, across various implementation forms.This flowchart illustrates various implementations of using a large-scale language model when generating assistant output offline for later use in an online setting.This flowchart illustrates how to utilize a large-scale language model when generating assistant output based on the generation of assistant queries, using various implementation forms.This flowchart illustrates an example of using