US-12626717-B2 - Linear filtering for noise-suppressed speech detection

US12626717B2US 12626717 B2US12626717 B2US 12626717B2US-12626717-B2

Abstract

Systems and methods for suppressing noise and detecting voice input in a multi-channel audio signal captured by a plurality of microphones include (i) capturing a first audio signal via a first microphone and a second audio signal via a second microphone, wherein the first and second audio signals respectively comprises first and second noise content from a noise source; (ii) identifying the first noise content in the first audio signal; (iii) using the identified first noise content to determine an estimated noise content captured by the plurality of microphones; (iv) using the estimated noise content to suppress the first and second noise content in the first and second audio signals; (v) combining the suppressed first and second audio signals into a third audio signal; and (vi) determining that the third audio signal includes a voice input comprising a wake word.

Inventors

Saeed Bagheri Sereshki
Daniele Giacobello

Assignees

SONOS, INC.

Dates

Publication Date: 20260512
Application Date: 20230608

Claims (20)

1 . A network device comprising: a plurality of microphones comprising a first microphone and a second microphone; one or more processors; and tangible, non-transitory, computer-readable media storing instructions executable by the one or more processors to cause the network device to perform operations comprising: capturing (i) a first audio signal via the first microphone and (ii) a second audio signal via the second microphone, wherein the first audio signal comprises first noise content from a noise source and the second audio signal comprises second noise content from the noise source; identifying the first noise content in the first audio signal; using the identified first noise content to determine an estimated noise content captured by the plurality of microphones; using the estimated noise content to suppress the first noise content in the first audio signal and the second noise content in the second audio signal; combining the suppressed first audio signal and the suppressed second audio signal into a third audio signal; and determining a probability that the first audio signal comprises speech content, wherein the steps of (i) identifying the first noise content in the first audio signal and (ii) using the identified first noise content to determine the estimated noise content captured by the plurality of microphones are carried out based on the determined probability being below a speech threshold probability, and wherein the first and second microphones are disposed along a housing of the network device and separated from one another by a distance that is greater than about five centimeters.
2 . The network device of claim 1 , wherein the operations further comprise: capturing a fourth audio signal via a third microphone of the plurality of microphones, wherein the fourth audio signal comprises third noise content from the noise source; identifying the third noise content in the fourth audio signal; and using the identified third noise content to update the estimated noise content captured by the plurality of microphones.
3 . The network device of claim 2 , wherein the network device captures the fourth audio signal concurrently with capturing the first and second audio signals.
4 . The network device of claim 2 , wherein the third microphone is disposed along the housing and is separated from the first microphone and the second microphone by a distance that is greater than about five centimeters.
5 . The network device of claim 1 , wherein the operations further comprise: determining that the third audio signal includes a voice input comprising a wake word; and in response to the determination, transmitting a portion of the voice input after the wake word to a separate computing system for voice analysis.
6 . The network device of claim 1 , wherein using the estimated noise content to suppress the first noise content in the first audio signal and the second noise content in the second audio signal comprises: using a multi-channel Wiener filter (MCWF) to filter out the estimated noise content from the first audio signal and the second audio signal.
7 . The network device of claim 1 , wherein the operations further comprise: identifying the second noise content in the second audio signal, wherein using the identified first noise content to determine the estimated noise content captured by the plurality of microphones comprises: using the identified first noise content and the identified second noise content to determine the estimated noise content captured by the plurality of microphones.
8 . Tangible, non-transitory, computer-readable media storing instructions executable by one or more processors to cause a network device to perform operations comprising: capturing, via a plurality of microphones of a network device, (i) a first audio signal via a first microphone of the plurality of microphones and (ii) a second audio signal via a second microphone of the plurality of microphones, wherein the first audio signal comprises first noise content from a noise source and the second audio signal comprises second noise content from the noise source; identifying the first noise content in the first audio signal; using the identified first noise content to determine an estimated noise content captured by the plurality of microphones; using the estimated noise content to suppress the first noise content in the first audio signal and the second noise content in the second audio signal; combining the suppressed first audio signal and the suppressed second audio signal into a third audio signal; and determining a probability that the first audio signal comprises speech content, wherein the steps of (i) identifying the first noise content in the first audio signal and (ii) using the identified first noise content to determine the estimated noise content captured by the plurality of microphones are carried out based on the determined probability being below a speech threshold probability, and wherein the first and second microphones are disposed along a housing of the network device and separated from one another by a distance that is greater than about five centimeters.
9 . The tangible, non-transitory, computer-readable media of claim 8 , wherein the operations further comprise: capturing a fourth audio signal via a third microphone of the plurality of microphones, wherein the fourth audio signal comprises third noise content from the noise source; identifying the third noise content in the fourth audio signal; and using the identified third noise content to update the estimated noise content captured by the plurality of microphones.
10 . The tangible, non-transitory, computer-readable media of claim 9 , wherein the fourth audio signal is captured concurrently with the first and second audio signals.
11 . The tangible, non-transitory, computer-readable media of claim 9 , wherein the third microphone is disposed along the housing and is separated from the first microphone and the second microphone by a distance that is greater than about five centimeters.
12 . The tangible, non-transitory, computer-readable media of claim 8 , wherein the operations further comprise: determining that the third audio signal includes a voice input comprising a wake word; and in response to the determination, transmitting a portion of the voice input after the wake word to a separate computing system for voice analysis.
13 . The tangible, non-transitory, computer-readable media of claim 8 , wherein using the estimated noise content to suppress the first noise content in the first audio signal and the second noise content in the second audio signal comprises: using a multi-channel Wiener filter (MCWF) to filter out the estimated noise content from the first audio signal and the second audio signal.
14 . The tangible, non-transitory, computer-readable media of claim 8 , wherein the operations further comprise: identifying the second noise content in the second audio signal, wherein using the identified first noise content to determine the estimated noise content captured by the plurality of microphones comprises: using the identified first noise content and the identified second noise content to determine the estimated noise content captured by the plurality of microphones.
15 . A method comprising: capturing, via a plurality of microphones of a network device, (i) a first audio signal via a first microphone of the plurality of microphones and (ii) a second audio signal via a second microphone of the plurality of microphones, wherein the first audio signal comprises first noise content from a noise source and the second audio signal comprises second noise content from the noise source; identifying the first noise content in the first audio signal; using the identified first noise content to determine an estimated noise content captured by the plurality of microphones; using the estimated noise content to suppress the first noise content in the first audio signal and the second noise content in the second audio signal; combining the suppressed first audio signal and the suppressed second audio signal into a third audio signal; and determining a probability that the first audio signal comprises speech content, wherein the steps of (i) identifying the first noise content in the first audio signal and (ii) using the identified first noise content to determine the estimated noise content captured by the plurality of microphones are carried out based on the determined probability being below a speech threshold probability, and wherein the first and second microphones are disposed along a housing of the network device and separated from one another by a distance that is greater than about five centimeters.
16 . The method of claim 15 , further comprising: capturing a fourth audio signal via a third microphone of the plurality of microphones, wherein the fourth audio signal comprises third noise content from the noise source; identifying the third noise content in the fourth audio signal; and using the identified third noise content to update the estimated noise content captured by the plurality of microphones.
17 . The method of claim 16 , wherein the fourth audio signal is captured concurrently with the first and second audio signals.
18 . The method of claim 16 , wherein the third microphone is disposed along the housing and is separated from the first microphone and the second microphone by a distance that is greater than about five centimeters.
19 . The method of claim 16 , wherein using the estimated noise content to suppress the first noise content in the first audio signal and the second noise content in the second audio signal comprises: using a multi-channel Wiener filter (MCWF) to filter out the estimated noise content from the first audio signal and the second audio signal.
20 . The method of claim 15 , further comprising: identifying the second noise content in the second audio signal, wherein using the identified first noise content to determine the estimated noise content captured by the plurality of microphones comprises: using the identified first noise content and the identified second noise content to determine the estimated noise content captured by the plurality of microphones.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This application claims priority under 35 U.S.C. § 120 to, and is a continuation of, U.S. patent application Ser. No. 16/949,973 filed on Nov. 23, 2020, entitled “Linear Filtering for Noise-Suppressed Speech Detection,” which is claims priority under 35 U.S.C. § 120 to, and is a continuation of, U.S. patent application Ser. No. 15/984,073 filed on May 18, 2018, and issued as U.S. Pat. No. 10,847,178 on Nov. 24, 2020, entitled “Linear Filtering for Noise-Suppressed Speech Detection.” The content of these applications is incorporated herein by reference in its entirety. FIELD OF THE DISCLOSURE The disclosure is related to consumer goods and, more particularly, to methods, systems, products, features, services, and other elements directed to media playback and aspects thereof. BACKGROUND Options for accessing and listening to digital audio in an out-loud setting were limited until in 2003, when Sonos, Inc. filed for one of its first patent applications, entitled “Method for Synchronizing Audio Playback between Multiple Network devices,” and began offering a media playback system for sale in 2005. The Sonos Wireless HiFi System enables people to experience music from many sources via one or more networked playback devices. Through a software control application installed on a smartphone, tablet, or computer, one can play what he or she wants in any room that has a networked playback device. Additionally, using the controller, for example, different songs can be streamed to each room with a playback device, rooms can be grouped together for synchronous playback, or the same song can be heard in all rooms synchronously. Given the ever-growing interest in digital media, there continues to be a need to develop consumer-accessible technologies to further enhance the listening experience. SUMMARY The present disclosure describes systems and methods for, among other things, processing audio content captured by multiple networked microphones in order to suppress noise content from the captured audio and detect a voice input in the captured audio. Some example embodiments involve capturing, via a plurality of microphones of a network device, (i) a first audio signal via a first microphone of the plurality of microphones and (ii) a second audio signal via a second microphone of the plurality of microphones. The first audio signal comprises first noise content from a noise source and the second audio signal comprises second noise content from the same noise source. The network device identifies the first noise content in the first audio signal and uses the identified first noise content to determine an estimated noise content captured by the plurality of microphones. Then the network device uses the estimated noise content to suppress the first noise content in the first audio signal and the second noise content in the second audio signal. The network device combines the suppressed first audio signal and the suppressed second audio signal into a third audio signal. Finally, the network device determines that the third audio signal includes a voice input comprising a wake word and, in response to the determination, transmitting at least a portion of the voice input to a remote computing device for voice processing to identify a voice utterance different from the wake word. Some embodiments include an article of manufacture comprising tangible, non-transitory, computer-readable media storing program instructions that, upon execution by one or more processors of a network device, cause the network device to perform operations in accordance with the example embodiments disclosed herein. Some embodiments include a network device comprising one or more processors, as well as tangible, non-transitory, computer-readable media storing program instructions that, upon execution by the one or more processors, cause the network device to perform operations in accordance with the example embodiments disclosed herein. This summary overview is illustrative only and is not intended to be limiting. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features will become apparent by reference to the figures and the following detailed description. BRIEF DESCRIPTION OF THE DRAWINGS Features, aspects, and advantages of the presently disclosed technology may be better understood with regard to the following description, appended claims, and accompanying drawings where: FIG. 1 shows an example media playback system configuration in which certain embodiments may be practiced; FIG. 2 shows a functional block diagram of an example playback device; FIG. 3 shows a functional block diagram of an example control device; FIG. 4 shows an example controller interface; FIG. 5 shows an example plurality of network devices; FIG. 6 shows a functional block diagram of an example network microphone device; FIG. 7A shows an example network device having micr