US-12626712-B2 - Speech enhancement system

US12626712B2US 12626712 B2US12626712 B2US 12626712B2US-12626712-B2

Abstract

A method of suppressing noise may include receiving a sequence of audio frames representing a multi-channel audio signal. The method may include determining a likelihood of speech in a first audio frame of the sequence of audio frames based on a Gaussian mixture model. Further, the method may include generating a first audio signal based on the likelihood of speech in the first audio frame and a second audio signal representing a first speech component of a second audio frame. The second audio frame follows the first audio frame in the sequence of audio frames. The method may also include determining, using a neural network model, a likelihood of speech in the second audio frame based on the first audio signal, and filtering a noise component of the second audio frame based at least in part on the likelihood of speech in the second audio frame.

Inventors

Saeed Mosayyebpour Kaskari
Gandhi Namani

Assignees

SYNAPTICS INCORPORATED

Dates

Publication Date: 20260512
Application Date: 20230419

Claims (20)

1 . A method of suppressing noise, comprising: receiving a sequence of audio frames representing a multi-channel audio signal, the sequence of audio frames including at least a first audio frame (N) and a second audio frame (N+1) that immediately follows the first audio frame N in the sequence of a audio frames; determining a likelihood of speech in the first audio frame N based on a Gaussian mixture model (GMM); generating an initial audio signal based on a speech component of the second audio frame N+1; generating an enhanced audio signal for the second audio frame N+1 based on the likelihood of speech in the first audio frame N and as the initial audio signal; determining, using a neural network model, a likelihood of speech in the second audio frame N+1 based on the enhanced audio signal; and filtering a noise component of the second audio frame NI based at least in part on the likelihood of speech in the second audio frame N+1.
2 . The method of claim 1 , further comprising: determining a first voice activity detection value based on the second initial audio signal and the likelihood of speech in the second audio frame N+1.
3 . The method of claim 2 , further comprising: determining a second voice activity detection value based at least in part on a second speech component of the second audio frame N+1 and the noise component of the second audio frame N+1.
4 . The method of claim 3 , further comprising: determining a set of parameters for the GMM based at least in part on the first and second voice activity detection values, wherein the likelihood of speech in the second audio frame N+1 is determined, using the GMM, based on the set of parameters.
5 . The method of claim 4 , further comprising: generating a second enhanced audio signal based at least in part on the likelihood of speech in the second audio frame N+1 and the second speech component of the second audio frame N+1.
6 . The method of claim 5 , wherein the third-second enhanced audio signal is determined using a single channel post-filter.
7 . The method of claim 6 , wherein the single channel post-filter comprises a Wiener filter.
8 . The method of claim 1 , further comprising: storing the likelihood of speech in the first audio frame N in a delay component prior to generating the enhanced audio signal.
9 . The method of claim 1 , wherein the noise component of the second audio frame N+1 is filtered using a spatial filter.
10 . The method of claim 9 , wherein the spatial filter comprises a minimum variance distortionless response beamformer or an independent component analysis.
11 . The method of claim 1 , wherein the neural network model comprises a deep neural network model.
12 . The method of claim 1 , wherein the GMM comprises an online GMM.
13 . A system, comprising: a processing system; and a memory storing instructions that, when executed by the processing system, cause the system to: receive a sequence of audio frames representing a multi-channel audio signal, the sequence of audio frames including at least a first audio frame (N) and a second audio frame (N+1) that immediately follows the first audio frame N in the sequence of audio frames; determine a likelihood of speech in the first audio frame N based on a Gaussian mixture model (GMM); generate an initial audio signal based on a speech component of the second audio frame N+1; generate an enhanced audio signal for the second audio frame N+1 based on the likelihood of speech in the first audio frame N and the initial audio signal; determine, using a neural network model, a likelihood of speech in the second audio frame N+1 based on the enhanced audio signal; and filter a noise component of the second audio frame N+1 based at least in part on the likelihood of speech in the second audio frame N+1.
14 . The system of claim 13 , wherein execution of the instructions further causes the system to: determine a first voice activity detection value based on the second-initial audio signal and the likelihood of speech in the second audio frame N+1.
15 . The system of claim 14 , wherein execution of the instructions further causes the system to: determine a second voice activity detection value based at least in part on a second speech component of the second audio frame N+1 and the noise component of the second audio frame N+1.
16 . The system of claim 15 , wherein execution of the instructions further causes the system to: determine a set of parameters for the GMM based at least in part on the first and second voice activity detection values, wherein the likelihood of speech in the second audio frame N+1 is determined, using the GMM, based on the set of parameters.
17 . The system of claim 16 , wherein execution of the instructions further causes the system to: generate a second enhanced audio signal based at least in part on the likelihood of speech in the second audio frame N+1 and the second speech component of the second audio frame N+1.
18 . The system of claim 17 , wherein the second enhanced audio signal is determined using a single channel post-filter.
19 . The system of claim 18 , wherein the single channel post-filter comprises a Wiener filter.
20 . The system of claim 13 , wherein execution of the instructions further causes the system to: store the likelihood of speech in the first audio frame N in a delay component prior to generating the enhanced audio signal.

Description

TECHNICAL FIELD The present embodiments relate generally to signal processing, and specifically to signal processing techniques for speech enhancement. BACKGROUND OF RELATED ART A hands-free communication device may include a microphone array configured to convert sound waves into a multi-channel audio signal, which may be transmitted over a communications channel to a receiving device. The multi-channel audio signal may be represented in the time-frequency domain as a sequence of frames, and include speech (e.g., from a user of the communication device) and noise (e.g., from a reverberant enclosure). Before the multi-channel audio signal is transmitted to the receiving device, the communication device may employ a signal processing technique known as speech enhancement, which attempts to suppress the noise in the multi-channel audio signal while reducing or minimizing speech distortion. Some communication devices may use a spatial filter (e.g., a beamformer) for speech enhancement. The spatial filter may utilize a Voice Activity Detector (also referred to as a “VAD”) to determine the presence or absence of speech in each frame of the multi-channel audio signal. Some VADs may be implemented using machine learning (such as a neural network based on a neural network model). However, the accuracy of such VADs may suffer due to differences between data used to train and test the neural network model, or due to a high amount of noise in the audio signals input to the neural network. Some communication devices may also use a post-filter, such as a binary mask or Wiener-like gain, to suppress residual noise in the enhanced speech signal produced by the spatial filter. However, such post-filters do not explicitly model uncertainty in the spatial filter, and thus require a heuristic tuning hyperparameter optimized to avoid distorting the enhanced speech signal. SUMMARY This Summary is provided to introduce in a simplified form a selection of concepts that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to limit the scope of the claimed subject matter. One innovative aspect of the subject matter of this disclosure can be implemented in a method of suppressing noise. The method may include receiving a sequence of audio frames representing a multi-channel audio signal. The method may further include determining a likelihood of speech in a first audio frame of the sequence of audio frames based on a Gaussian mixture model. The method may include generating a first audio signal based on the likelihood of speech in the first audio frame and a second audio signal representing a first speech component of a second audio frame. The second audio frame follows the first audio frame in the sequence of audio frames. The method may also include determining, using a neural network model, a likelihood of speech in the second audio frame based on the first audio signal. The method may include filtering a noise component of the second audio frame based at least in part on the likelihood of speech in the second audio frame. Another innovative aspect of the subject matter of this disclosure can be implemented in a system including a processing system and a memory. The memory may store instructions that, when executed by the processing system, cause the system to receive a sequence of audio frames representing a multi-channel audio signal, and determine a likelihood of speech in a first audio frame of the sequence of audio frames based on a Gaussian mixture model. Execution of the instructions may further cause the system to generate a first audio signal based on the likelihood of speech in the first audio frame and a second audio signal representing a first speech component of a second audio frame. The second audio frame follows the first audio frame in the sequence of audio frames. Execution of the instructions may further cause the system to determine, using a neural network model, a likelihood of speech in the second audio frame based on the first audio signal, and filter a noise component of the second audio frame based at least in part on the likelihood of speech in the second audio frame. BRIEF DESCRIPTION OF THE DRAWINGS The present embodiments are illustrated by way of example and are not intended to be limited by the figures of the accompanying drawings. FIG. 1 shows a block diagram of an example audio processing system, according to some embodiments. FIG. 2 shows a block diagram of an example speech enhancement system, according to some embodiments. FIG. 3 shows a block diagram of an example speech enhancement system, according to some embodiments. FIG. 4 shows an illustrative flowchart depicting an example operation for processing audio signals, according to some embodiments. DETAILED DESCRIPTION In the following description, numerous specific details are set forth such as examples of specific