US-12626690-B2 - Systems, methods, and devices for low-power audio signal detection

US12626690B2US 12626690 B2US12626690 B2US 12626690B2US-12626690-B2

Abstract

Systems, methods, and devices detect wake signals included in audio signals. Methods include receiving a dataset including raw audio data, the raw audio data comprising a plurality of audio samples and associated metadata, and generating, using one or more processing elements, an augmented dataset based on the raw audio data, the augmented dataset comprising a plurality of annotations identifying types of raw audio data. Methods further include generating, using the one or more processing elements, a feature dataset by extracting features from the augmented dataset based, at least in part, on the plurality of annotations, and generating, using the one or more processing elements, a wake signal detection model based, at least in part, on the feature dataset, the wake signal detection model being a machine learning model trained based on the feature dataset.

Inventors

Aidan Smyth
Ashutosh Pandey
Niall Lyons
Ted Wada
Robert Zopf

Assignees

CYPRESS SEMICONDUCTOR CORPORATION

Dates

Publication Date: 20260512
Application Date: 20230814

Claims (20)

1 . A method comprising: receiving a dataset including raw audio data, the raw audio data comprising a plurality of audio samples and associated metadata; generating, using one or more processing elements, an augmented dataset based on the raw audio data, the augmented dataset comprising a plurality of annotations identifying types of raw audio data, the plurality of annotations being generated based, at least in part, on the metadata associated with the plurality of audio samples; generating, using the one or more processing elements, a feature dataset by extracting features from the augmented dataset based, at least in part, on the plurality of annotations and using the plurality of annotations to serialize the extracted features to generate an input for a wake signal detection model; and generating, using the one or more processing elements, the wake signal detection model based, at least in part, on the feature dataset, the wake signal detection model being a machine learning model trained based on the feature dataset, the generating of the wake signal detection model further comprising reducing a number of dimensions of the machine learning model based, at least in part, on power consumption characteristics of a target audio signal processing device.
2 . The method of claim 1 further comprising: generating, using the one or more processing elements, training data based, at least in part, on the feature dataset.
3 . The method of claim 2 , wherein the generating of the training data further comprises: concatenating at least some of the extracted features included in the feature dataset; and generating an output file based on the concatenation of the at least some of the extracted features.
4 . The method of claim 1 , wherein the generating of the augmented dataset further comprises: classifying a plurality of phonemes included in the raw audio data; generating a plurality of tokens based on the plurality of phonemes; and generating the plurality of annotations based on the plurality of tokens.
5 . The method of claim 4 , wherein the classifying is performed by an automatic speech recognition model, and wherein the plurality of tokens identify whether or not speech is present in each of the plurality of phonemes.
6 . The method of claim 1 further comprising: testing the wake signal detection model using test data.
7 . The method of claim 6 further comprising: modifying one or more weights associated with the wake signal detection model based on a result of the testing.
8 . The method of claim 1 further comprising: generating a low-power model based on the wake signal detection model, the low-power model having the reduced number of dimensions.
9 . The method of claim 8 , wherein the low-power model is configured to execute on a low-power device in real-time.
10 . A system comprising: a communications interface configured to receive raw audio data, the raw audio data comprising a plurality of audio samples and associated metadata; one or more processing elements configured to: generate a dataset based on the raw audio data received from the communications interface; generate an augmented dataset based on the raw audio data, the augmented dataset comprising a plurality of annotations identifying types of raw audio data, the plurality of annotations being generated based, at least in part, on the metadata associated with the plurality of audio samples; generate a feature dataset by extracting features from the augmented dataset based, at least in part, on the plurality of annotations and using the plurality of annotations to serialize the extracted features to generate an input for a wake signal detection model; and generate the wake signal detection model based, at least in part, on the feature dataset, the wake signal detection model being a machine learning model trained based on the feature dataset, the generating of the wake signal detection model further comprising reducing a number of dimensions of the machine learning model based, at least in part, on power consumption characteristics of a target audio signal processing device.
11 . The system of claim 10 , wherein the one or more processing elements are further configured to: generate training data based, at least in part, on the feature dataset.
12 . The system of claim 11 , wherein the one or more processing elements are further configured to: concatenate at least some of the extracted features included in the feature dataset; and generate an output file based on the concatenation of the at least some of the extracted features.
13 . The system of claim 10 , wherein the one or more processing elements are further configured to: classify a plurality of phonemes included in the raw audio data, wherein the classifying is performed by an automatic speech recognition model; generate a plurality of tokens based on the plurality of phonemes, wherein the plurality of tokens identify whether or not speech is present in each of the plurality of phonemes; and generate the plurality of annotations based on the plurality of tokens.
14 . The system of claim 10 , wherein the one or more processing elements are further configured to: generate a low-power model based on the wake signal detection model, the low-power model having the reduced number of dimensions.
15 . The system of claim 14 , wherein the low-power model is configured to execute on a low-power device in real-time.
16 . A device comprising: one or more processing elements configured to: receive a dataset including raw audio data, the raw audio data comprising a plurality of audio samples and associated metadata; generate an augmented dataset based on the raw audio data, the augmented dataset comprising a plurality of annotations identifying types of raw audio data, the plurality of annotations being generated based, at least in part, on the metadata associated with the plurality of audio samples; generate a feature dataset by extracting features from the augmented dataset based, at least in part, on the plurality of annotations and using the plurality of annotations to serialize the extracted features to generate an input for a wake signal detection model; and generate the wake signal detection model based, at least in part, on the feature dataset, the wake signal detection model being a machine learning model trained based on the feature dataset, the generating of the wake signal detection model further comprising reducing a number of dimensions of the machine learning model based, at least in part, on power consumption characteristics of a target audio signal processing device.
17 . The device of claim 16 , wherein the one or more processing elements are further configured to: generate training data based, at least in part, on the feature dataset.
18 . The device of claim 17 , wherein the one or more processing elements are further configured to: concatenate at least some of the extracted features included in the feature dataset; and generate an output file based on the concatenation of the at least some of the extracted features.
19 . The device of claim 16 , wherein the one or more processing elements are further configured to: classify a plurality of phonemes included in the raw audio data, wherein the classifying is performed by an automatic speech recognition model; generate a plurality of tokens based on the plurality of phonemes, wherein the plurality of tokens identify whether or not speech is present in each of the plurality of phonemes; and generate the plurality of annotations based on the plurality of tokens.
20 . The device of claim 16 , wherein the one or more processing elements are further configured to: generate a low-power model based on the wake signal detection model, wherein the low-power model has the reduced number of dimensions, and wherein the low-power model is configured to execute on a low-power device in real-time.

Description

TECHNICAL FIELD This disclosure relates to low-power devices, and more specifically, to enhancement of audio signal detection performed by such low-power devices. BACKGROUND Audio and voice control capabilities may be applied in systems and devices in a variety of contexts, such as smart devices and smart appliances. Such smart devices may include smart assistants, also referred to as virtual assistants, that are configured to respond to voice commands. For example, a user may provide a specific word and/or phrase that may trigger activation of the smart device. Such a phrase may include one or more specific wake words that wake the smart device, and may cause the smart to device to perform one or more operations. Conventional techniques for processing such wake words remain limited because they are limited in their ability to identify such wake words in a power efficient and accurate manner. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 illustrates an example of a system for low-power audio signal detection, configured in accordance with some embodiments. FIG. 2 illustrates an example of a device for low-power audio signal detection, configured in accordance with some embodiments. FIG. 3 illustrates an example of a method for low-power audio signal detection, performed in accordance with some embodiments. FIG. 4 illustrates an example of another method for low-power audio signal detection, performed in accordance with some embodiments. FIG. 5 illustrates an example of an additional method for low-power audio signal detection, performed in accordance with some embodiments. FIG. 6 illustrates an example of another method for low-power audio signal detection, performed in accordance with some embodiments. FIG. 7 illustrates an example of a method for low-power audio signal detection, performed in accordance with some embodiments. FIG. 8 illustrates an example of an additional method for low-power audio signal detection, performed in accordance with some embodiments. DETAILED DESCRIPTION In the following description, numerous specific details are set forth in order to provide a thorough understanding of the presented concepts. The presented concepts may be practiced without some or all of these specific details. In other instances, well known process operations have not been described in detail so as not to unnecessarily obscure the described concepts. While some concepts will be described in conjunction with the specific examples, it will be understood that these examples are not intended to be limiting. Systems and devices may be configured to implement voice control functionalities for a variety of purposes, such as for smart devices and smart appliances. For example, smart devices may include smart assistants, also referred to as virtual assistants, that are configured to respond to voice commands. For example, a smart device may be in a dormant state and may be in a sleep mode. In response to detecting a particular auditory input, such as an audio signal which may be include particular word and/or phrase, the smart device may wake and listen for a command or a query. Conventional techniques for identifying such voice inputs and commands are limited because they utilize components having high power consumption characteristics, or may have relatively low accuracy when implemented in a low-power context. Embodiments disclosed herein provide audio signal detection techniques having increased accuracy in low-power operational contexts. As will be discussed in greater detail below, embodiments disclosed herein perform data augmentation, feature extraction, and annotation to generate training data used to configure a machine learning model, and a low-power version of the machine learning model is then implemented in a an audio signal processing device that may be a low-power device. Accordingly, the model may be trained to increase the accuracy of the low-power device used for audio signal detection, and also increase an overall power efficiency by reducing inadvertent and erroneous wake transitions. FIG. 1 illustrates an example of a system for low-power audio signal detection, configured in accordance with some embodiments. Accordingly, a system, such as system 100, may include various devices which may communicate with each other via a network, such as network 104. Moreover, one or more of the devices may be include a low-power circuit configured to identify a wake word or phrase. As will be discussed in greater detail below, embodiments disclosed herein are configured to increase the efficiency and accuracy of such devices when identifying such wake words and phrases. Accordingly, system 100 includes an audio signal processing device, such as audio signal processing device 102, which is configured to receive and analyze audio input to identify the presence of a particular audio signal, which may be a wake word or phrase. In one example, audio signal processing device 102 is a smart home device configured to suppo