US-20260128038-A1 - SELECTIVELY MASKING QUERY CONTENT TO PROVIDE TO A SECONDARY DIGITAL ASSISTANT

US20260128038A1US 20260128038 A1US20260128038 A1US 20260128038A1US-20260128038-A1

Abstract

Systems and methods for obfuscating and/or omitting potentially sensitive information in a spoken query before providing the query to a secondary automated assistant. A general automated assistant may be invoked by a user, followed by a query. The audio data can be processed to omit and/or obfuscate potentially sensitive information before providing one or more processed queries to secondary automated assistants based on a trust metric associated with each of the secondary automated assistants. The trust metric for a secondary automated assistant is indicative of trust in being provided with sensitive information. In response, the automated assistants can generate responses, which can be filtered to provide a response to the user.

Inventors

Matthew Sharifi
Victor Carbune

Assignees

GOOGLE LLC

Dates

Publication Date: 20260507
Application Date: 20260105

Claims (20)

1 . A method implemented by one or more processors, the method comprising: responsive to detecting occurrence of an assistant invocation event at a client device: processing, by a general automated assistant, audio data that captures a spoken query of a user and that is generated by one or more microphones of the client device; identifying a previously generated trust metric for a secondary automated assistant; determining whether the trust metric for the secondary automated assistant satisfies a threshold; in response to determining the trust metric satisfies the threshold: providing, to the secondary automated assistant, first content that is based on the audio data; and in response to determining the trust metric fails to satisfy the threshold: providing, to the secondary automated assistant, second content that is based on the audio data, wherein the second content differs from the first content.
2 . The method of claim 1 , wherein providing, to the secondary automated assistant, the first content that is based on the audio data comprises: providing, to the secondary automated assistant, a portion of the audio data.
3 . The method of claim 1 , wherein providing, to the secondary automated assistant, the second content that is based on the audio data comprises: omitting and/or obfuscating a portion of the audio data; and providing, to the secondary automated assistant, the second content, wherein the second content includes the audio data that does not include the omitted portion of the audio data and/or includes the obfuscated portion of the audio data.
4 . The method of claim 3 , wherein processing the audio data comprises: determining a type of sensitive information included in the audio data, wherein the threshold is based on the type of sensitive information.
5 . The method of claim 4 , wherein omitting and/or obfuscating the portion of the audio data comprises: omitting and/or obfuscating the type of sensitive information included in the portion of the audio data.
6 . The method of claim 3 , wherein obfuscating the portion of the audio data comprises: determining a generalization of the portion of the audio data; and replacing the portion of the audio data with the generalization.
7 . The method of claim 3 , wherein omitting and/or obfuscating the portion of the audio data comprises: determining that the portion of the audio data includes background audio; and omitting and/or obfuscating the portion of the audio data that includes the background audio.
8 . A system comprising: memory storing instructions; and one or more processors operable to execute the instructions to: responsive to detecting occurrence of an assistant invocation event at a client device: process, by a general automated assistant, audio data that captures a spoken query of a user and that is generated by one or more microphones of the client device; identify a previously generated trust metric for a secondary automated assistant; determine whether the trust metric for the secondary automated assistant satisfies a threshold; in response to determining the trust metric satisfies the threshold: provide, to the secondary automated assistant, first content that is based on the audio data; and in response to determining the trust metric fails to satisfy the threshold: provide, to the secondary automated assistant, second content that is based on the audio data, wherein the second content differs from the first content.
9 . The system of claim 8 , wherein in providing, to the secondary automated assistant, the first content that is based on the audio data, one or more of the processors are to: provide, to the secondary automated assistant, a portion of the audio data.
10 . The system of claim 8 , wherein in providing, to the secondary automated assistant, the second content that is based on the audio data, one or more of the processors are to: omit and/or obfuscate a portion of the audio data; and provide, to the secondary automated assistant, the second content, wherein the second content includes the audio data that does not include the omitted portion of the audio data and/or includes the obfuscated portion of the audio data.
11 . The system of claim 10 , wherein in processing the audio data, one or more of the processors are to: determine a type of sensitive information included in the audio data, wherein the threshold is based on the type of sensitive information.
12 . The system of claim 11 , wherein in omitting and/or obfuscating the portion of the audio data, one or more of the processors are to: omit and/or obfuscate the type of sensitive information included in the portion of the audio data.
13 . The system of claim 10 , wherein in obfuscating the portion of the audio data, one or more of the processors are to: determine a generalization of the portion of the audio data; and replace the portion of the audio data with the generalization.
14 . The system of claim 10 , wherein in omitting and/or obfuscating the portion of the audio data, one or more of the processors are to: determine that the portion of the audio data includes background audio; and omit and/or obfuscate the portion of the audio data that includes the background audio.
15 . A non-transitory computer readable storage medium configured to store instructions that, when executed by one or more processors, cause one or more of the processors to: responsive to detecting occurrence of an assistant invocation event at a client device: process, by a general automated assistant, audio data that captures a spoken query of a user and that is generated by one or more microphones of the client device; identify a previously generated trust metric for a secondary automated assistant; determine whether the trust metric for the secondary automated assistant satisfies a threshold; in response to determining the trust metric satisfies the threshold: provide, to the secondary automated assistant, first content that is based on the audio data; and in response to determining the trust metric fails to satisfy the threshold: provide, to the secondary automated assistant, second content that is based on the audio data, wherein the second content differs from the first content.
16 . The non-transitory computer readable storage medium of claim 15 , wherein in providing, to the secondary automated assistant, the first content that is based on the audio data, one or more of the processors are to: provide, to the secondary automated assistant, a portion of the audio data.
17 . The non-transitory computer readable storage medium of claim 15 , wherein in providing, to the secondary automated assistant, the second content that is based on the audio data, one or more of the processors are to: omit and/or obfuscate a portion of the audio data; and provide, to the secondary automated assistant, the second content, wherein the second content includes the audio data that does not include the omitted portion of the audio data and/or includes the obfuscated portion of the audio data.
18 . The non-transitory computer readable storage medium of claim 17 , wherein in processing the audio data, one or more of the processors are to: determine a type of sensitive information included in the audio data, wherein the threshold is based on the type of sensitive information.
19 . The non-transitory computer readable storage medium of claim 18 , wherein in omitting and/or obfuscating the portion of the audio data, one or more of the processors are to: omit and/or obfuscate the type of sensitive information included in the portion of the audio data.
20 . The non-transitory computer readable storage medium of claim 17 , wherein in obfuscating the portion of the audio data, one or more of the processors are to: determine a generalization of the portion of the audio data; and replace the portion of the audio data with the generalization.

Description

BACKGROUND Humans may engage in human-to-computer dialogs with interactive software applications referred to herein as “automated assistants” (also referred to as “digital agents,” “chatbots,” “interactive personal assistants,” “intelligent personal assistants,” “assistant applications,” “conversational agents,” etc.). For example, humans (which when they interact with automated assistants may be referred to as “users”) may provide commands and/or requests to an automated assistant using spoken natural language input (i.e., utterances), which may in some cases be converted into text and then processed, and/or by providing textual (e.g., typed) natural language input. An automated assistant responds to a request by providing responsive user interface output, which can include audible and/or visual user interface output. As mentioned above, many automated assistants are configured to be interacted with via spoken utterances, such as an invocation indication followed by a spoken query. To preserve user privacy and/or to conserve resources, a user must often explicitly invoke an automated assistant before the automated assistant will fully process a spoken utterance. The explicit invocation of an automated assistant typically occurs in response to certain user interface input being received at a client device. The client device includes an assistant interface that provides, to a user of the client device, an interface for interfacing with the automated assistant (e.g., receives spoken and/or typed input from the user, and provides audible and/or graphical responses), and that interfaces with one or more additional components that implement the automated assistant (e.g., remote server device(s) that process user inputs and generate appropriate responses). Some user interface inputs that can invoke an automated assistant via a client device include a hardware and/or virtual button at the client device for invoking the automated assistant (e.g., a tap of a hardware button, a selection of a graphical interface element displayed by the client device). Many automated assistants can additionally or alternatively be invoked in response to one or more spoken invocation phrases, which are also known as “hot words/phrases” or “trigger words/phrases”. For example, a spoken invocation phrase such as “Hey Assistant,” “OK Assistant”, and/or “Assistant”can be spoken to invoke an automated assistant. Often, a client device that includes an assistant interface includes one or more locally stored models that the client device utilizes to monitor for an occurrence of a spoken invocation phrase. Such a client device can locally process received audio data utilizing the locally stored model, and discard any audio data that does not include the spoken invocation phrase. However, when local processing of received audio data indicates an occurrence of a spoken invocation phrase, the client device will then cause that audio data and/or following audio data to be further processed by the automated assistant. For instance, if a spoken invocation phrase is “Hey, Assistant”, and a user speaks “Hey, Assistant, what time is it”, audio data corresponding to “what time is it” can be processed by an automated assistant based on detection of “Hey, Assistant”, and utilized to provide an automated assistant response of the current time. If, on the other hand, the user simply speaks “what time is it” (without first speaking an invocation phrase or providing alternate invocation input), no response from the automated assistant will be provided as a result of “what time is it” not being preceded by an invocation phrase (or other invocation input). SUMMARY Implementations described herein relate to determining, based on a trust level associated with a secondary automated assistant, what content, related to a spoken utterance of a user, to provide to the secondary automated assistant for use, by the secondary automated assistant, in resolving the spoken utterance. In those implementations, content provided to a given secondary automated assistant for a given spoken utterance will vary in dependence on the trust level for the given secondary automated assistant. This can result in first content (e.g., that includes audio data capturing the spoken utterance) being provided to a first secondary automated assistant for the given spoken utterance, but different second content (e.g., that obfuscates at least a portion of the audio data, includes an obfuscated version of the audio data, omits at least a portion of the original content) being provided to a second secondary automated assistant for a same instance (or another instance) of the given spoken utterance. For example, the first content can include audio data capturing the spoken utterance, whereas the second content can omit at least a portion of the audio data or include an obfuscated version of the audio data. For instance, the second content can omit the audio data entirely and include only speech r