CN-116615779-B - Frozen word

CN116615779BCN 116615779 BCN116615779 BCN 116615779BCN-116615779-B

Abstract

A method (300) for detecting frozen words includes receiving audio data (118), the audio data (118) corresponding to an utterance (119) spoken by a user (10) and captured by a user device (110) associated with the user. The method also includes processing the audio data using the speech recognizer (200) to determine that the utterance includes a query (122) that performs an operation on the digital assistant (109). The speech recognizer is configured to trigger termination of the utterance after a predetermined duration of non-speech in the audio data. Before the predetermined duration of non-speech, the method includes detecting a frozen word in the audio data (123). In response to detecting the frozen word in the audio data, the method further includes triggering a hard microphone off event (125) at the user device. The hard microphone off event prevents the user device from capturing any audio after the frozen word.

Inventors

Matthew Sheriff
Alexander Krakun

Assignees

谷歌有限责任公司

Dates

Publication Date: 20260508
Application Date: 20211117
Priority Date: 20201208

Claims (20)

1. A method for detecting frozen words, comprising: receiving, at data processing hardware, audio data corresponding to an utterance spoken by a user and captured by a user device associated with the user; processing the audio data by the data processing hardware to determine the utterance includes a query that performs an operation on a digital assistant, wherein the speech recognizer is configured to trigger termination of the utterance after a predetermined duration of non-speech in the audio data, and Prior to the predetermined duration of non-speech in the audio data: Detecting, by the data processing hardware, a frozen word in the audio data, the frozen word following the query in the utterance spoken by the user and captured by the user device, and In response to detecting the frozen word in the audio data: Triggering, by the data processing hardware, a hard microphone off event at the user device to prevent the user device from capturing any further audio data of the utterance that follows the frozen word; Modifying, by the data processing hardware, the speech recognition result of the audio data by stripping the frozen word from the speech recognition result, and The speech recognition results are provided by the data processing hardware for output from the user device.
2. The method of claim 1, wherein the frozen word comprises one of: A predefined frozen word comprising one or more fixed terms across all users in a given language; A user-selected frozen word comprising one or more terms specified by the user of the user device, or Action-specific frozen words associated with the operations to be performed by the digital assistant.
3. The method of claim 1, wherein detecting the frozen word in the audio data comprises: Extracting audio features from the audio data; generating a frozen word confidence score by processing the extracted audio features using a frozen word detection model, the frozen word detection model executing on the data processing hardware, and When the frozen word confidence score meets a frozen word confidence threshold, determining that the audio data corresponding to the utterance includes the frozen word.
4. The method of claim 1, wherein detecting the frozen word in the audio data comprises recognizing the frozen word in the audio data using the speech recognizer executing on the data processing hardware.
5. The method of claim 1, further comprising, in response to detecting the frozen word in the audio data: instructing the speech recognizer by the data processing hardware to stop any active processing of the audio data, and The digital assistant is instructed by the data processing hardware to complete execution of the operation.
6. The method of claim 1, wherein processing the audio data to determine the utterance comprises performing the query of the operation for the digital assistant comprises: processing the audio data using the speech recognizer to generate a speech recognition result for the audio data, and Performing semantic interpretation of the speech recognition results of the audio data to determine the audio data includes performing the query of the operation.
7. The method of claim 6, further comprising, in response to detecting the frozen word in the audio data: the voice recognition result is used by the data processing hardware to instruct the digital assistant to perform the operation of the query request.
8. The method of claim 1, further comprising, prior to processing the audio data using the speech recognizer: detecting, by the data processing hardware, hotwords in the audio data that precede the query using a hotword detection model, and In response to detecting the hotword, the voice recognizer is triggered by the data processing hardware to process the audio data by performing voice recognition on the hotword and/or one or more terms in the audio data that follow the hotword.
9. The method of claim 8, further comprising verifying, by the data processing hardware, the presence of the hotword detected by the hotword detection model based on detecting the frozen word in the audio data.
10. The method according to claim 8, wherein: detecting the frozen word in the audio data includes executing a frozen word detection model on the data processing hardware configured to detect the frozen word in the audio data without performing speech recognition on the audio data, and The frozen word detection model and the hot word detection model each comprise the same or different neural network-based models.
11. A method for detecting frozen words, the method comprising: Receiving, at data processing hardware, a first instance of audio data corresponding to a dictation-based query for audible content spoken by a digital assistant dictation user, the dictation-based query being spoken by the user and captured by an assistant-enabled device associated with the user; Receiving, at the data processing hardware, a second instance of the audio data, the second instance of the audio data corresponding to an utterance of the audible content spoken by the user and captured by the assistant-enabled device; Processing, by the data processing hardware, the second instance of the audio data using a speech recognizer to generate a transcription of the audible content, and During the processing of the second instance of the audio data: detecting, by the data processing hardware, a frozen word in the second instance of the audio data, the frozen word following the audible content in the utterance spoken by the user and captured by the assistant-enabled device, and In response to detecting the frozen word in the second instance of the audio data: stripping the frozen word from the end of the transcription of the audible content spoken by the user by the data processing hardware, and The transcription of the audible content spoken by the user is provided by the data processing hardware for output from the assistant-enabled device.
12. The method of claim 11, further comprising, in response to detecting the frozen word in the second instance of the audio data: Initiating, by the data processing hardware, a hard microphone off event at the assistant-enabled device to prevent the assistant-enabled device from capturing any further audio of the utterance that follows the frozen word, and Any active processing of the second instance of the audio data is stopped by the data processing hardware.
13. The method of claim 11, further comprising: Processing, by the data processing hardware, the first instance of the audio data using the speech recognizer to generate a speech recognition result, and Performing, by the data processing hardware, semantic interpretation of the speech recognition results of the first instance of the audio data to determine the first instance of the audio data includes dictating the dictation-based query of the audible content spoken by the user.
14. The method of claim 13, further comprising, prior to initiating processing of the second instance of the audio data to generate the transcription: Determining, by the data processing hardware, that the dictation-based query specifies the frozen word based on the semantic interpretation performed on the speech recognition result of the first instance of the audio data, and The data processing hardware instructs a terminator to increase a termination timeout duration for terminating the utterance of the audible content.
15. A system for detecting frozen words, the system comprising: data processing hardware, and Memory hardware in communication with the data processing hardware, the memory hardware storing instructions that, when executed on the data processing hardware, cause the data processing hardware to perform operations comprising: Receiving audio data corresponding to an utterance spoken by a user and captured by a user device associated with the user; Processing the audio data using a speech recognizer to determine the utterance includes performing a query of an operation for a digital assistant, wherein the speech recognizer is configured to trigger termination of the utterance after a predetermined duration of non-speech in the audio data, and Prior to the predetermined duration of non-speech in the audio data: detecting a frozen word in the audio data, the frozen word following the query in the utterance spoken by the user and captured by the user device, and In response to detecting the frozen word in the audio data: Triggering a hard microphone off event at the user device to prevent the user device from capturing any further audio data of the utterance that follows the frozen word; Modifying, by the data processing hardware, the speech recognition result of the audio data by stripping the frozen word from the speech recognition result, and The speech recognition result is provided for output by the user device.
16. The system of claim 15, wherein the frozen word comprises one of: A predefined frozen word comprising one or more fixed terms across all users in a given language; A user-selected frozen word comprising one or more terms specified by the user of the user device, or Action-specific frozen words associated with the operations to be performed by the digital assistant.
17. The system of claim 15, wherein detecting the frozen word in the audio data comprises: Extracting audio features from the audio data; generating a frozen word confidence score by processing the extracted audio features using a frozen word detection model, the frozen word detection model executing on the data processing hardware, and When the frozen word confidence score meets a frozen word confidence threshold, determining that the audio data corresponding to the utterance includes the frozen word.
18. The system of claim 15, wherein detecting the frozen word in the audio data comprises recognizing the frozen word in the audio data using the speech recognizer executing on the data processing hardware.
19. The system of claim 15, wherein the operations further comprise, in response to detecting the frozen word in the audio data: Instructing the speech recognizer to stop any active processing of the audio data, and The digital assistant is instructed to complete execution of the operation.
20. The system of claim 15, wherein processing the audio data to determine the utterance comprises performing the query of the operation for the digital assistant comprises: processing the audio data using the speech recognizer to generate a speech recognition result for the audio data, and Performing semantic interpretation of the speech recognition results of the audio data to determine the audio data includes performing the query of the operation.

Description

Frozen word Technical Field The present disclosure relates to frozen words. Background A voice-enabled environment (e.g., home, workplace, school, car, etc.) allows a user to speak a query or command aloud to a computer-based system that responds and answers the query and/or performs a function based on the command. A voice-enabled environment may be implemented using a network of connected microphone devices distributed throughout various rooms or areas of the environment. These devices may use hotwords to help discern when a given utterance is directed to the system, as opposed to an utterance for another individual present in the environment. Thus, the device may operate in a sleep state or a dormant state and wake up only when the detected utterance includes a hotword. Once awake, the device can continue to perform more expensive processing, such as Automated Speech Recognition (ASR) or server-based ASR entirely on the device. Disclosure of Invention One aspect of the present disclosure provides a method of detecting a frozen word. The method includes receiving, at data processing hardware, audio data corresponding to an utterance spoken by a user and captured by a user device associated with the user. The method also includes processing, by the data processing hardware, the audio data using the speech recognizer to determine that the utterance includes a query that performs an operation on the digital assistant. The speech recognizer is configured to trigger termination of the utterance (endpoint) after a predetermined duration of non-speech in the audio data. Before a predetermined duration of non-speech in the audio data, the method includes detecting, by the data processing hardware, a frozen word in the audio data. The frozen word follows the query in the utterance spoken by the user and captured by the user device. In response to detecting the frozen word in the audio data, the method includes triggering, by the data processing hardware, a hard microphone off event at the user device to prevent the user device from capturing any audio following the frozen word. Implementations of the disclosure may include one or more of the following optional features. In some implementations, the frozen term includes one of a predefined frozen term including one or more fixed terms across all users in a given language, a user-selected frozen term including one or more terms specified by a user of the user device, or an action-specific frozen term associated with an operation to be performed by the digital assistant. In some examples, detecting the frozen word in the audio data includes extracting an audio feature from the audio data, generating a frozen word confidence score by processing the extracted audio feature using a frozen word detection model, the frozen word detection model executing on the data processing hardware, and determining that the audio data corresponding to the utterance includes the frozen word when the frozen word confidence score meets a frozen word confidence threshold. Detecting the frozen words in the audio data may include identifying the frozen words in the audio data using a speech recognizer executing on the data processing hardware. Optionally, the method may further comprise instructing, by the data processing hardware, the speech recognizer to stop any active processing of the audio data in response to detecting the freeze term in the audio data, and instructing, by the data processing hardware, the digital assistant to complete execution of the operation. In some implementations, processing the audio data to determine the utterance includes performing a query of an operation on the digital assistant includes processing the audio data using a speech recognizer to generate a speech recognition result of the audio data and performing semantic interpretation of the speech recognition result of the audio data to determine the audio data includes performing the query of the operation. In these embodiments, in response to detecting the frozen word in the audio data, the method further includes modifying, by the data processing hardware, the speech recognition result of the audio data by stripping the frozen word from the speech recognition result, and instructing, by the data processing hardware, the digital assistant to perform the operation of the query request using the modified speech recognition result. In some examples, before processing the audio data using the speech recognizer, the method further includes detecting, by the data processing hardware, a hotword preceding the query in the audio data using a hotword detection model, and, in response to detecting the hotword, triggering, by the data processing hardware, the speech recognizer to process the audio data by performing speech recognition on the hotword and/or one or more terms in the audio data that follow the hotword. In these examples, the method may further include verifying, by the data processing hardware, the pres