US-12620403-B2 - Neural noise reduction with linear and nonlinear filtering for single-channel audio signals

US12620403B2US 12620403 B2US12620403 B2US 12620403B2US-12620403-B2

Abstract

This disclosure provides methods, devices, and systems for audio signal processing. The present implementations more specifically relate to speech enhancement techniques that combine statistical signal processing with neural network inferencing. In some aspects, a speech enhancement system may include a linear filter, a deep neural network (DNN), and a nonlinear post-filter. The linear filter and the nonlinear post-filter are configured to suppress noise in audio signals using statistical signal processing techniques. More specifically, the linear filter denoises an input audio signal based on a temporal correlation between successive frames of the audio signal. The DNN infers a speech signal and a noise signal (representing a speech component and a noise component, respectively, of the audio signal) based on the denoised audio signal. The nonlinear post-filter suppresses residual noise in the speech signal based on one or more Gaussian mixture models (GMM) associated with the speech signal and the noise signal.

Inventors

Saeed Mosayyebpour Kaskari
Gandhi Namani

Assignees

SYNAPTICS INCORPORATED

Dates

Publication Date: 20260505
Application Date: 20230502

Claims (20)

1 . A method of speech enhancement, comprising: receiving a series of frames of an audio signal; denoising a first frame in the series of frames based at least in part on a temporal correlation between the series of frames; inferring a probability of speech associated with the denoised first frame based on a neural network model; generating a first speech signal and a first noise signal based on the probability of speech associated with the denoised first frame, the first speech signal and the first noise signal representing a speech component and a noise component, respectively, of the audio signal in the first frame; determining a first spectral suppression gain based on the first speech signal and the first noise signal; and suppressing residual noise in the first speech signal based on the first spectral suppression gain.
2 . The method of claim 1 , wherein the audio signal comprises a single channel of audio data.
3 . The method of claim 1 , wherein the first frame is denoised based on a multi-frame minimum variance distortionless response (MF-MVDR) beamformer that reduces a power of the noise component of the audio signal without distorting the speech component.
4 . The method of claim 1 , further comprising: determining an interframe correlation (IFC) vector associated with a speech component of the audio signal based at least in part on the probability of speech associated with the denoised first frame; denoising a second frame in the series of frames based at least in part on the IFC vector; inferring a probability of speech associated with the denoised second frame based on the neural network model; generating a second speech signal and a second noise signal based on the probability of speech associated with the denoised second frame, the second speech signal and the second noise signal representing a speech component and a noise component, respectively, of the audio signal in the second frame; determining a second spectral suppression gain based on the second speech signal and the second noise signal; and suppressing residual noise in the second speech signal based on the second spectral suppression gain.
5 . The method of claim 1 , wherein the speech component of the audio signal in the denoised first frame is equal to the speech component of the audio signal in the first speech signal.
6 . The method of claim 1 , wherein the determining of the first spectral suppression gain comprises: determining a number (M) of voice activity detection (VAD) features that are indicative of whether speech is present in the first frame based at least in part on the first speech signal and the first noise signal; determining M probabilities of speech associated with the first speech signal based on the M VAD features, respectively; and determining a magnitude or power of the residual noise in the first speech signal based on the M probabilities of speech associated with the first speech signal.
7 . The method of claim 6 , wherein the magnitude or power of the residual noise in the first speech signal is determined based only on the lowest probability of speech among the M probabilities of speech associated with the first speech signal.
8 . The method of claim 6 , wherein each of the M probabilities of speech associated with the first speech signal is determined based on a respective Gaussian mixture model (GMM).
9 . The method of claim 6 , wherein M>1.
10 . The method of claim 6 , wherein the M VAD features include a normalized difference between the first speech signal and the first noise signal.
11 . The method of claim 6 , wherein the M VAD features include at least one of a cepstral peak, a spectral entropy, or a harmonic product spectrum (HPS) associated with the first speech signal.
12 . A speech enhancement system comprising: a processing system; and a memory storing instructions that, when executed by the processing system, causes the speech enhancement system to: receive a series of frames of an audio signal; denoise a first frame in the series of frames based at least in part on a temporal correlation between the series of frames; infer a probability of speech associated with the denoised first frame based on a neural network model; generate a first speech signal and a first noise signal based on the probability of speech associated with the denoised first frame, the first speech signal and the first noise signal representing a speech component and a noise component, respectively, of the audio signal in the first frame; determine a first spectral suppression gain based on the first speech signal and the first noise signal; and suppress residual noise in the first speech signal based on the first spectral suppression gain.
13 . The speech enhancement system of claim 12 , wherein the audio signal comprises a single channel of audio data.
14 . The speech enhancement system of claim 12 , wherein the first frame is denoised based on a multi-frame minimum variance distortionless response (MF-MVDR) beamformer that reduces a power of the noise component of the audio signal without distorting the speech component.
15 . The speech enhancement system of claim 12 , wherein execution of the instructions further causes the speech enhancement system to: determine an interframe correlation (IFC) vector associated with a speech component of the audio signal based at least in part on the probability of speech associated with the denoised first frame; denoise a second frame in the series of frames based at least in part on the IFC vector; infer a probability of speech associated with the denoised second frame based on the neural network model; generate a second speech signal and a second noise signal based on the probability of speech associated with the denoised second frame, the second speech signal and the second noise signal representing a speech component and a noise component, respectively, of the audio signal in the second frame; determine a second spectral suppression gain based on the second speech signal and the second noise signal; and suppress residual noise in the second speech signal based on the second spectral suppression gain.
16 . The speech enhancement system of claim 12 , wherein the speech component of the audio signal in denoised first frame is equal to the speech component of the audio signal in the first speech signal.
17 . The speech enhancement system of claim 12 , wherein the determining of the first spectral suppression gain comprises: determining a number (M) of voice activity detection (VAD) features that are indicative of whether speech is present in the first frame based at least in part on the first speech signal and the first noise signal; determining M probabilities of speech associated with the first speech signal based on the M VAD features, respectively; and determining a magnitude or power of the residual noise in the first speech signal based on the M probabilities of speech associated with the first speech signal.
18 . The speech enhancement system of claim 17 , wherein each of the M probabilities of speech associated with the first speech signal is determined based on a respective Gaussian mixture model (GMM).
19 . The speech enhancement system of claim 17 , wherein M>1.
20 . The speech enhancement system of claim 17 , wherein the M VAD features include at least one of a normalized difference between the first speech signal and the first noise signal, a cepstral peak of the first speech signal, a spectral entropy of the first speech signal, or a harmonic product spectrum (HPS) associated with the first speech signal.

Description

TECHNICAL FIELD The present implementations relate generally to signal processing, and specifically to neural noise reduction techniques with linear and nonlinear filtering for single-channel audio signals. BACKGROUND OF RELATED ART Many hands-free communication devices include microphones configured to convert sound waves into audio signals that can be transmitted, over a communications channel, to a receiving device. The audio signals often include a speech component (such as from a user of the communication device) and a noise component (such as from a reverberant enclosure). Speech enhancement is a signal processing technique that attempts to suppress the noise component of the received audio signals without distorting the speech component. Many existing speech enhancement techniques rely on statistical signal processing algorithms that continuously track the pattern of noise in each frame of the audio signal to model a spectral suppression gain or filter that can be applied to the received audio signal in a time-frequency domain. Some modern speech enhancement techniques implement machine learning to model a spectral suppression gain or filter that can be applied to the received audio signal in a time-frequency domain. Machine learning, which generally includes a training phase and an inferencing phase, is a technique for improving the ability of a computer system or application to perform a certain task. During the training phase, a machine learning system is provided with one or more “answers” and a large volume of raw training data associated with the answers. The machine learning system analyzes the training data to learn a set of rules that can be used to describe each of the one or more answers. During the inferencing phase, the machine learning system may infer answers from new data using the learned set of rules. Deep learning is a particular form of machine learning in which the inferencing (and training) phases are performed over multiple layers, producing a more abstract dataset in each successive layer. Deep learning architectures are often referred to as “artificial neural networks” due to the manner in which information is processed (similar to a biological nervous system). For example, each layer of an artificial neural network may be composed of one or more “neurons.” The neurons may be interconnected across the various layers so that the input data can be processed and passed from one layer to another. More specifically, each layer of neurons may perform a different transformation on the output data from a preceding layer so that the final output of the neural network results in a desired inference. The set of transformations associated with the various layers of the network is referred to as a “neural network model.” The size of a neural network (such as the number of layers in the neural network or the number of neurons in each layer) generally affects the accuracy of the inferencing result. More specifically, larger neural networks tend to produce more accurate inferences than smaller or more compact neural networks. However, speech enhancement for single-channel audio is often implemented by low power edge devices with very limited resources (such as battery-powered headsets, earbuds, and other hands-free communication devices with a single microphone input). As such, many existing single channel speech enhancement techniques rely on compact neural network architectures that produce filtered audio signals with some amount of speech distortion or noise leakage (also referred to as “residual noise”). Thus, there is a need to improve the quality of speech in single-channel audio signals. SUMMARY This Summary is provided to introduce in a simplified form a selection of concepts that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to limit the scope of the claimed subject matter. One innovative aspect of the subject matter of this disclosure can be implemented in a method of speech enhancement. The method includes steps of receiving a series of frames of an audio signal; denoising a first frame in the series of frames based at least in part on a temporal correlation between the series of frames; inferring a probability of speech associated with the denoised first frame based on a neural network model; generating a first speech signal and a first noise signal based on the probability of speech associated with the denoised first frame, where the first speech signal and the first noise signal represent a speech component and a noise component, respectively, of the audio signal in the first frame; determining a first spectral suppression gain based on the first speech signal and the first noise signal; and suppressing residual noise in the first speech signal based on the first spectral suppression gain. Another innovative aspect of the subject matter of thi