US-12626697-B2 - System and method for keyword false alarm reduction

US12626697B2US 12626697 B2US12626697 B2US 12626697B2US-12626697-B2

Abstract

A method includes extracting, using a keyword detection model, audio features from audio data. The method also includes processing the audio features by a first layer of the keyword detection model configured to predict a first likelihood that the audio data includes speech. The method also includes processing the audio features by a second layer of the keyword detection model configured to predict a second likelihood that the audio data includes keyword-like speech. The method also includes processing the audio features by a third layer of the keyword detection model configured to predict a third likelihood, for each of a plurality of possible keywords, that the audio data includes the keyword. The method also includes identifying a keyword included in the audio data. The method also includes generating instructions to perform an action based at least in part on the identified keyword.

Inventors

Rakshith Sharma Srinivasa
Yashas Malur Saidutta
Ching-Hua Lee
Chou-Chang Yang
Yilin Shen
Hongxia Jin

Assignees

SAMSUNG ELECTRONICS CO., LTD.

Dates

Publication Date: 20260512
Application Date: 20230714

Claims (20)

1 . A method comprising: obtaining audio data from an audio input device; providing the audio data as input to a keyword detection model; extracting, using the keyword detection model, audio features from the audio data; processing the audio features by a first layer of the keyword detection model configured to predict a first likelihood that the audio data includes speech; processing the audio features by a second layer of the keyword detection model configured to predict a second likelihood that the audio data includes keyword-like speech; processing the audio features by a third layer of the keyword detection model configured to predict a third likelihood, for each of a plurality of possible keywords, that the audio data includes the keyword; providing each of the first likelihood, the second likelihood, and the third likelihood to an inference combination layer of the keyword detection model; identifying a keyword included in the audio data based on a combination of the first likelihood, the second likelihood, and the third likelihood by the inference combination layer; and generating instructions to perform an action based at least in part on the identified keyword.
2 . The method of claim 1 , further comprising: processing the audio features by the second layer of the keyword detection model in response to the first likelihood exceeding a first threshold; and processing the audio features by the third layer of the keyword detection model in response to the second likelihood exceeding a second threshold, wherein identifying the keyword includes identifying which one of the plurality of possible keywords is associated with a highest third likelihood.
3 . The method of claim 1 , wherein identifying the keyword includes determining, based on the first likelihood, the second likelihood, and each of the third likelihood, that the audio data includes the keyword.
4 . The method of claim 1 , wherein the audio features are extracted from the audio data using a backbone model of the keyword detection model.
5 . The method of claim 1 , wherein the keyword detection model is trained using a training dataset that includes a first set of audio data samples including non-speech audio, a second set of audio data samples including non-keyword speech, and a third set of audio data samples including a keyword, and wherein each audio data sample is annotated with a speech label indicating whether the audio data sample includes speech, a keyword-like label indicating whether the audio data sample includes keyword-like speech, and a keyword label identifying which keyword, if any, is in the audio data sample.
6 . The method of claim 5 , wherein the first layer of the keyword detection model is trained using the first set of audio data samples, the second set of audio data samples, and the third set of audio data samples to distinguish between speech and non-speech audio, wherein the audio data samples including non-keyword speech and the audio data samples including a keyword are pooled into a speech class.
7 . The method of claim 5 , wherein the second layer of the keyword detection model is trained using the second set of audio data samples and the third set of audio data samples to distinguish between non-keyword and keyword-like speech, wherein the audio data samples including keyword-like speech are pooled into a keyword-like class.
8 . The method of claim 5 , wherein the third layer of the keyword detection model is trained using the third set of audio data samples.
9 . An electronic device comprising: at least one processing device configured to: obtain audio data from an audio input device; provide the audio data as input to a keyword detection model; extract, using the keyword detection model, audio features from the audio data; process the audio features by a first layer of the keyword detection model configured to predict a first likelihood that the audio data includes speech; process the audio features by a second layer of the keyword detection model configured to predict a second likelihood that the audio data includes keyword-like speech; process the audio features by a third layer of the keyword detection model configured to predict a third likelihood, for each of a plurality of possible keywords, that the audio data includes the keyword; provide each of the first likelihood, the second likelihood, and the third likelihood to an inference combination layer of the keyword detection model; identify a keyword included in the audio data based on a combination of the first likelihood, the second likelihood, and the third likelihood by the inference combination layer; and generate instructions to perform an action based at least in part on the identified keyword.
10 . The electronic device of claim 9 , wherein the at least one processing device is further configured to: process the audio features by the second layer of the keyword detection model in response to the first likelihood exceeding a first threshold; and process the audio features by the third layer of the keyword detection model in response to the second likelihood exceeding a second threshold, wherein, to identify the keyword, the at least one processing device is further configured to identify which one of the plurality of possible keywords is associated with a highest third likelihood.
11 . The electronic device of claim 9 , wherein, to identify the keyword, the at least one processing device is further configured to determine, based on the first likelihood, the second likelihood, and each of the third likelihood, that the audio data includes the keyword.
12 . The electronic device of claim 9 , wherein the audio features are extracted from the audio data using a backbone model of the keyword detection model.
13 . The electronic device of claim 9 , wherein the keyword detection model is trained using a training dataset that includes a first set of audio data samples including non-speech audio, a second set of audio data samples including non-keyword speech, and a third set of audio data samples including a keyword, and wherein each audio data sample is annotated with a speech label indicating whether the audio data sample includes speech, a keyword-like label indicating whether the audio data sample includes keyword-like speech, and a keyword label identifying which keyword, if any, is in the audio data sample.
14 . The electronic device of claim 13 , wherein the first layer of the keyword detection model is trained using the first set of audio data samples, the second set of audio data samples, and the third set of audio data samples to distinguish between speech and non-speech audio, wherein the audio data samples including non-keyword speech and the audio data samples including a keyword are pooled into a speech class.
15 . The electronic device of claim 13 , wherein the second layer of the keyword detection model is trained using the second set of audio data samples and the third set of audio data samples to distinguish between non-keyword and keyword-like speech, wherein the audio data samples including keyword-like speech are pooled into a keyword-like class.
16 . The electronic device of claim 13 , wherein the third layer of the keyword detection model is trained using the third set of audio data samples.
17 . A non-transitory machine readable medium containing instructions that when executed cause at least one processor of an electronic device to: obtain audio data from an audio input device; provide the audio data as input to a keyword detection model; extract, using the keyword detection model, audio features from the audio data; process the audio features by a first layer of the keyword detection model configured to predict a first likelihood that the audio data includes speech; process the audio features by a second layer of the keyword detection model configured to predict a second likelihood that the audio data includes keyword-like speech; process the audio features by a third layer of the keyword detection model configured to predict a third likelihood, for each of a plurality of possible keywords, that the audio data includes the keyword; provide each of the first likelihood, the second likelihood, and the third likelihood to an inference combination layer of the keyword detection model; identify a keyword included in the audio data based on a combination of the first likelihood, the second likelihood, and the third likelihood by the inference combination layer, and generate instructions to perform an action based at least in part on the identified keyword.
18 . The non-transitory machine readable medium of claim 17 , further comprising instructions that when executed cause the at least one processor to: process the audio features by the second layer of the keyword detection model in response to the first likelihood exceeding a first threshold; and process the audio features by the third layer of the keyword detection model in response to the second likelihood exceeding a second threshold, wherein, to identify the keyword, the instructions when executed further cause the at least one processor to identify which one of the plurality of possible keywords is associated with a highest third likelihood.
19 . The non-transitory machine readable medium of claim 17 , wherein to identify the keyword, the instructions when executed further cause the at least one processor to determine, based on the first likelihood, the second likelihood, and each of the third likelihood, that the audio data includes the keyword.
20 . The non-transitory machine readable medium of claim 17 , wherein: the keyword detection model is trained using a training dataset that includes a first set of audio data samples including non-speech audio, a second set of audio data samples including non-keyword speech, and a third set of audio data samples including a keyword; each audio data sample is annotated with a speech label indicating whether the audio data sample includes speech, a keyword-like label indicating whether the audio data sample includes keyword-like speech, and a keyword label identifying which keyword, if any, is in the audio data sample; the first layer of the keyword detection model is trained using the first set of audio data samples, the second set of audio data samples, and the third set of audio data samples to distinguish between speech and non-speech audio, wherein the audio data samples including non-keyword speech and the audio data samples including a keyword are pooled into a speech class; the second layer of the keyword detection model is trained using the second set of audio data samples and the third set of audio data samples to distinguish between non-keyword and keyword-like speech, wherein the audio data samples including keyword-like speech are pooled into a keyword-like class; and the third layer of the keyword detection model is trained using the third set of audio data samples.

Description

CROSS-REFERENCE TO RELATED APPLICATION AND PRIORITY CLAIM This application claims priority under 35 U.S.C. § 119(e) to U.S. Provisional Patent Application No. 63,419,268 filed on Oct. 25, 2022, which is hereby incorporated by reference in its entirety. TECHNICAL FIELD This disclosure relates generally to machine learning systems. More specifically, this disclosure relates to a system and method for keyword false alarm reduction. BACKGROUND Voice-based interaction forms one of the fundamental ways in which people interact with smart devices. Such interactions can be controlled by “keywords,” which are short words or phrases associated with specific follow-up actions. One example of a keyword is a wake-word, which is used to wake up a device from sleep mode. Keyword detection systems are used to continuously process incoming audio streams to detect these keywords. These systems generally need to have a low false alarm rate while maintaining a high detection rate. A high false alarm rate results in unnecessary triggering of downstream applications, leading to many undesirable outcomes including unintended recording and sharing of user audio, and wasteful device power consumption. This is especially exacerbated in real-world systems that often process “out-of-domain” audio on which they have not been trained. SUMMARY This disclosure relates to a system and method for keyword false alarm reduction. In a first embodiment, a method includes obtaining audio data from an audio input device. The method also includes providing the audio data as input to a keyword detection model. The method also includes extracting, using the keyword detection model, audio features from the audio data. The method also includes processing the audio features by a first layer of the keyword detection model configured to predict a first likelihood that the audio data includes speech. The method also includes processing the audio features by a second layer of the keyword detection model configured to predict a second likelihood that the audio data includes keyword-like speech. The method also includes processing the audio features by a third layer of the keyword detection model configured to predict a third likelihood, for each of a plurality of possible keywords, that the audio data includes the keyword. The method also includes identifying a keyword included in the audio data. The method also includes generating instructions to perform an action based at least in part on the identified keyword. In a second embodiment, an electronic device includes at least one processing device. The at least one processing device is configured to obtain audio data from an audio input device. The at least one processing device is also configured to provide the audio data as input to a keyword detection model. The at least one processing device is also configured to extract, using the keyword detection model, audio features from the audio data. The at least one processing device is also configured to process the audio features by a first layer of the keyword detection model configured to predict a first likelihood that the audio data includes speech. The at least one processing device is also configured to process the audio features by a second layer of the keyword detection model configured to predict a second likelihood that the audio data includes keyword-like speech. The at least one processing device is also configured to process the audio features by a third layer of the keyword detection model configured to predict a third likelihood, for each of a plurality of possible keywords, that the audio data includes the keyword. The at least one processing device is also configured to identify a keyword included in the audio data. The at least one processing device is also configured to generate instructions to perform an action based at least in part on the identified keyword. In a third embodiment, a non-transitory machine readable medium contains instructions that when executed cause at least one processor of an electronic device to obtain audio data from an audio input device. The non-transitory machine readable medium also contains instructions that when executed cause the at least one processor to provide the audio data as input to a keyword detection model. The non-transitory machine readable medium also contains instructions that when executed cause the at least one processor to extract, using the keyword detection model, audio features from the audio data. The non-transitory machine readable medium also contains instructions that when executed cause the at least one processor to process the audio features by a first layer of the keyword detection model configured to predict a first likelihood that the audio data includes speech. The non-transitory machine readable medium also contains instructions that when executed cause the at least one processor to process the audio features by a second layer of the keyword detection model configured to predict