JP-7857347-B2 - Hot word threshold auto-tuning

JP7857347B2JP 7857347 B2JP7857347 B2JP 7857347B2JP-7857347-B2

Inventors

アイシャニー・シャー
アレクサンダー・エイチ・グルエンスタイン
イアン・シー・マッグロー

Assignees

グーグルエルエルシー

Dates

Publication Date: 20260512
Application Date: 20240701
Priority Date: 20200610

Claims (18)

A method performed by a computer on data processing hardware, wherein the data processing hardware includes: Receiving a near miss marker and audio data characterizing a hotword detected by a hotword detector in streaming audio captured by a user device, wherein the near miss marker indicates that the hotword detector detected the hotword in the streaming audio within a threshold period after generating a previous accuracy score that failed to satisfy the hotword detection threshold of the hotword detector by a threshold margin. Processing the audio data to confirm that the hotword was correctly detected by the hotword detector within the streaming audio, Based on the confirmation that the near miss marker and the hotword were correctly detected by the hotword detector within the streaming audio, the system determines whether the rejection rate, which is based on the number of rejection cases in which the hotword detector failed to detect the hotword within the audio features of the streaming audio and was identified by the hotword detector during the rejection period , satisfies the rejection threshold, and if it does, adjusts the hotword detection threshold of the hotword detector. A computer implementation method that causes an operation including the following to be performed.
The computer implementation method according to claim 1, wherein adjusting the hotword detection threshold of the hotword detector includes reducing the hotword detection threshold of the hotword detector.
The computer implementation method according to claim 1, wherein processing the audio data includes performing speech recognition on the audio data to confirm that the hotword was correctly detected by the hotword detector within the streaming audio when the hotword is recognized in the audio data.
The computer implementation method according to claim 1, wherein processing the audio data includes processing the audio data without performing semantic analysis or speech recognition processing on the audio data in order to confirm that the hotword has been correctly detected by the hotword detector in the streaming audio.
The aforementioned hotword detector Generate an accuracy score indicating the presence of the hotword in the audio features of the streaming audio captured by the user device. The system is configured to detect the hotword in the streaming audio when the accuracy score satisfies the hotword detection threshold of the first-stage hotword detector. The computer implementation method according to claim 1.
The computer implementation method according to claim 5, wherein the aforementioned prior accuracy score indicates the presence of the hotword within the prior audio features of the streaming audio captured by the user device.
The aforementioned operation is, The hotword detector identifies instances of user denial in the hotword detector that indicate it failed to detect the hotword within the previous audio features of the streaming audio, The method further includes determining whether the self-rejection rate associated with the hotword detector satisfies the self-rejection rate threshold, Adjusting the hotword detection threshold of the hotword detector is based on determining whether the self-rejection rate associated with the hotword detector satisfies the self-rejection rate threshold. The computer implementation method according to claim 6.
The computer implementation method according to claim 1, wherein the hotword detector operates on the user device.
The computer implementation method according to claim 1, wherein the hotword detector includes a neural network trained to detect the presence of the hotword in the streaming audio without performing semantic analysis or speech recognition processing on the streaming audio.
It is a system, Data processing hardware and The system comprises memory hardware that communicates with the data processing hardware, the memory hardware stores instructions, and when an instruction is executed on the data processing hardware, it causes the data processing hardware to perform an operation, the operation being: Receiving a near miss marker and audio data characterizing a hotword detected by a hotword detector in streaming audio captured by a user device, wherein the near miss marker indicates that the hotword detector detected the hotword in the streaming audio within a threshold period after generating a previous accuracy score that failed to satisfy the hotword detection threshold of the hotword detector by a threshold margin. Processing the audio data to confirm that the hotword was correctly detected by the hotword detector within the streaming audio, Based on the confirmation that the near miss marker and the hotword were correctly detected by the hotword detector within the streaming audio, the system determines whether the rejection rate, which is based on the number of rejection cases in which the hotword detector failed to detect the hotword within the audio features of the streaming audio and was identified by the hotword detector during the rejection period , satisfies the rejection threshold, and if it does, adjusts the hotword detection threshold of the hotword detector. including, system.
The system according to claim 10 , wherein adjusting the hotword detection threshold of the hotword detector includes reducing the hotword detection threshold of the hotword detector.
The system according to claim 10, wherein processing the audio data includes performing speech recognition on the audio data to confirm that the hotword was correctly detected by the hotword detector in the streaming audio when the hotword is recognized in the audio data.
The system according to claim 10, wherein processing the audio data includes processing the audio data without performing semantic analysis or speech recognition processing on the audio data in order to confirm that the hotword has been correctly detected by the hotword detector in the streaming audio .
The aforementioned hotword detector Generate an accuracy score indicating the presence of the hotword in the audio features of the streaming audio captured by the user device. The system is configured to detect the hotword in the streaming audio when the accuracy score satisfies the hotword detection threshold of the first-stage hotword detector. The system according to claim 10 .
The system according to claim 14 , wherein the aforementioned prior accuracy score indicates that the hotword is present in the prior audio features of the streaming audio captured by the user device.
The aforementioned operation is, The hotword detector identifies instances of user denial in the hotword detector that indicate it failed to detect the hotword within the previous audio features of the streaming audio, The method further includes determining whether the self-rejection rate associated with the hotword detector satisfies the self-rejection rate threshold, Adjusting the hotword detection threshold of the hotword detector is based on determining whether the self-rejection rate associated with the hotword detector satisfies the self-rejection rate threshold. The system according to claim 15 .
The hotword detector operates on the user device, according to claim 10 .
The system according to claim 10 , wherein the hotword detector includes a neural network trained to detect the presence of the hotword in the streaming audio without performing semantic analysis or speech recognition processing on the streaming audio.

Description

This disclosure relates to automatic tuning of hotword thresholds. A voice-enabled environment (e.g., home, work, school, car) allows a user to speak queries or commands aloud to a computer-based system, which then responds to those queries and/or performs functions based on those commands. A voice-enabled environment can be implemented using a network of connected microphone devices distributed throughout various rooms or areas of the environment. These devices use hotwords to help identify when a given utterance is directed at the system, rather than at another individual present in the environment. Therefore, the devices may operate in a sleep or hibernation state and only wake up when the detected utterance contains a hotword. Typically, a system used to detect hotwords in streaming audio generates a probability score indicating the likelihood that a hotword is present in the streaming audio. When the probability score satisfies a predetermined threshold, the device initiates the wake-up process. This is a schematic diagram of an example system that performs automatic tuning of hotword thresholds.This is a schematic diagram of an exemplary component of a hotword detection threshold adjuster.This is a schematic diagram of a hotword detection threshold adjuster that increments the count of accepted messages from others.This is a schematic diagram illustrating an example of accepting non-humans.This is a schematic diagram illustrating an example of a case where the person refuses to cooperate.This is a schematic diagram illustrating an example of a case where the person refuses to cooperate.This is a flowchart illustrating an example configuration of how to perform automatic tuning of hotword thresholds.This is a flowchart of another exemplary configuration for how to perform automatic threshold tuning.This is a schematic diagram of an exemplary computing device that may be used to implement the systems and methods described herein. Similar reference numerals in various drawings indicate similar elements. Voice-enabled devices (e.g., user devices running voice assistants) allow users to speak queries or commands aloud, respond to those queries, and/or perform functions based on those commands. Through the use of "hotwords" (also called "keywords," "attention words," "wake-up phases/words," "trigger phases," or "voice action initiation commands"), which are predetermined words/phrases spoken to attract the attention of a voice-enabled device, voice-enabled devices can distinguish between utterances directed at the system (i.e., to initiate a wake-up process for processing one or more words following a hotword in the utterance) and utterances directed at individuals in the environment. Typically, voice-enabled devices operate in a sleep state to conserve power and do not process input audio data unless input audio data follows a spoken hotword. For example, while in sleep mode, a voice-enabled device captures input audio via one or more microphones and uses a trained hotword detector to detect whether a hotword is present in the input audio. When a hotword is detected in the input audio, the voice-enabled device initiates a wake-up process to process the hotword and/or any other words in the input audio that follow it. Hotword detection is like finding a needle in a haystack, because the hotword detector must constantly listen to the streaming audio and trigger precisely and instantaneously when the presence of a hotword is detected within the streaming audio, while ignoring most of it. To address the complexity of detecting the presence of a hotword within a continuous audio stream, neural networks are commonly used by hotword detectors. Typically, the neural network generates a confidence score indicating whether a hotword is present in the streaming audio based on the received streaming audio. The hotword detector then determines whether the confidence score satisfies a detection threshold. If the confidence score satisfies the detection threshold, the hotword detector determines that a hotword is present in the streaming audio. The hotword detector may then initiate the device wake-up process. The hotword detection threshold is traditionally set to a predetermined value that balances the false acceptance rate and the false rejection rate. False acceptance occurs when the hotword detector detects a hotword (i.e., the accuracy score satisfies the hotword detection threshold), but the streaming audio does not actually contain the hotword. Despite false acceptance, the hotword detector will initiate the wake-up process on the voice-enabled device, even if the user did not intend to call the device. False rejection, on the other hand, occurs when the streaming audio contains a hotword, but the hotword detector determines that the hotword is not present in the streaming audio (i.e., the accuracy score cannot satisfy the hotword detection threshold). False rejection by the hotword detector is frustrating for th