JP-2026076293-A - Machine learning-assisted spatial noise estimation and suppression

JP2026076293AJP 2026076293 AJP2026076293 AJP 2026076293AJP-2026076293-A

Abstract

[Problem] To provide a system and method for estimating and suppressing spatial noise using machine learning support. [Solution] The noise estimation and suppression method includes the steps of: estimating the probability of speech and noise for each band of the input audio signal using a machine learning classifier; estimating a set of averages of speech and noise or a set of averages and covariances of speech and noise based on probability and microphone covariance with respect to band using a directional model; estimating the average and covariance of noise power based on probability and power spectrum using a level model; determining a first noise suppression gain based on the directional model; determining a second noise suppression gain based on the level model; selecting the first noise suppression gain or the second noise suppression gain or their sum based on the signal-to-noise ratio of the input audio signal; and scaling the time-frequency representation of the input signal by the selected noise suppression gain. [Selection Diagram] Figure 4B

Inventors

カートライト，リチャードジェイ．
ワーン，ニーン

Assignees

ドルビーラボラトリーズライセンシングコーポレイション

Dates

Publication Date: 20260511
Application Date: 20260209
Priority Date: 20201105

Claims (13)

Audio processing method: A step of receiving a representation of an input audio signal using at least one processor; A step of estimating noise suppression gain using a machine learning classifier, wherein the machine learning classifier takes the representation of the input audio signal as input; A step of scaling the time-frequency representation of the input audio signal by the noise suppression gain using at least one of the processors; and a step of converting the time-frequency representation into an output audio signal using at least one of the processors; A method that includes this.
A method according to claim 1, wherein the representation of the input audio signal includes the banded power spectrum of the input audio signal.
The method according to claim 2, further: A step of receiving an input audio signal including multiple blocks using at least one of the aforementioned processors; For each block: Using at least one of the aforementioned processors, the block is transformed into a plurality of subbands, each subband having a spectrum different from that of the other subbands; A step of coupling the subbands into a band using the at least one processor; and determining the banded power spectrum using the at least one processor; A method that includes this.
A method according to any one of claims 1 to 3, wherein the machine learning classifier is a neural network.
The method according to claim 4, wherein the neural network includes an input layer configured to map the representation of the input audio signal to a plurality of features.
The method according to claim 5, wherein the neural network further comprises a plurality of layers configured to determine the noise suppression gain based on the plurality of features.
A method according to any one of claims 4 to 6, wherein the neural network is trained by a cross-entropy loss function.
A machine learning-based classifier for audio processing: A neural network configured to estimate noise suppression gain based on the representation of the input audio signal; The neural network includes: A classifier comprising a plurality of layers, including at least one of an input layer, a GRU layer, and/or a high-density layer.
A machine learning-based classifier according to claim 8, wherein the neural network includes the input layer, and the input layer is configured to map the representation of the input audio signal to a plurality of features.
A machine learning-based classifier according to claim 9, wherein the plurality of layers are configured to determine the noise suppression gain based on the plurality of features.
A machine learning-based classifier according to any one of claims 8 to 10, wherein the neural network is trained using a cross-entropy loss function.
One or more computer processors; and non-temporary computer-readable storage media for storing instructions; A system comprising, wherein, when the instruction is executed by the one or more computer processors, the system causes the one or more computer processors to perform the method according to any one of claims 1 to 7.
A non-temporary, computer-readable storage medium for storing instructions, wherein, when executed by one or more computer processors, the instructions cause the one or more computer processors to perform the method described in any one of claims 1-7.

Description

[0001] Cross-reference of related applications This application claims priority to U.S. Provisional Application No. 63/110,228 filed on 5 November 2020 and U.S. Provisional Application No. 63/210,215 filed on 14 June 2021, both of which are invoked in their entirety by reference. [0002] Technical Field The disclosures relating to this invention generally relate to audio signal processing, and in particular to noise estimation and suppression in voice communications. [0003] Noise suppression algorithms for voice communications are effectively implemented in edge devices such as telephones, laptops, and conference systems. A common problem with bidirectional voice communications is that background noise at each user's location is transmitted along with the user's voice signal. If the signal-to-noise ratio (SNR) of the integrated signal received by the edge device is too low, the clarity of the reconstructed voice will be degraded, resulting in a poor user experience. [0004] An implementation for machine learning-assisted spatial noise estimation and suppression is described. In some embodiments, the audio processing method is: A step of receiving the bands of the power spectrum of an input audio signal and the microphone covariance of each band, wherein the microphone covariance is based on the arrangement of microphones used to capture the input audio signal; About each band: Using a machine learning classifier, estimate the probabilities of speech and noise, respectively; Using a directional model, estimate the set of speech and noise means, or the set of speech and noise means and covariances, based on the microphone covariance for the probability and band; Using a level model, the mean and covariance of noise power are estimated based on probability and power spectrum; The first noise suppression gain is determined based on the first output of the directional model; The second noise suppression gain is determined based on the second output of the level model; A step of selecting one of a first noise suppression gain, a second noise suppression gain, or the sum of the first and second noise suppression gains, based on the signal-to-noise ratio of the input audio signal; The process includes scaling the time-frequency representation of an input signal by a first or second noise suppression gain selected for a band; and converting the time-frequency representation into an output audio signal. [0005] In some embodiments, the method further includes: receiving an input audio signal comprising multiple blocks/frames using at least one processor; About each block/frame: The process includes the steps of: converting a block/frame into multiple subbands using at least one processor (each subband having a spectrum different from the others); combining the subbands into a band using at least one processor; and determining the banded power using at least one processor. [0006] In some embodiments, the machine learning classifier is a neural network comprising an input layer, an output layer, and one or more hidden layers. For example, the neural network is a deep neural network comprising three or more layers, preferably more than three. [0007] In some embodiments, the microphone covariance is represented as a normalized vector. [0008] In some embodiments, the method further includes the step of determining the first noise suppression gain, which further includes: calculating the probability of speech for the band; setting the first noise suppression gain to be equal to the maximum suppression gain if the probability of speech for the band is less than a threshold; and setting the first noise suppression gain based on a gain ramp if the calculated probability of speech for the band is greater than a threshold. [0009] In some embodiments, the probability of speech is calculated using a set of mean covariances, speech, and noise estimated by a directional model. [0010] In some embodiments, the speech probability is calculated using a covariance vector estimated by a directional model, a set of mean speech and noise, and a multi-variable, joint Gaussian density function. [0011] In some embodiments, the method further includes: the step of determining a second noise suppression gain further includes: setting the second noise suppression gain to be equal to the maximum suppression gain if the band power is less than a first threshold; setting the second noise suppression gain to zero if the band power is between the first and second thresholds (the second threshold is higher than the first threshold); and setting the second noise suppression gain based on a gain ramp if the band power is higher than the second threshold. [0012] In some embodiments, the estimation step using a directional model utilizes time-frequency tiles classified as speech and noise, excluding those classified as reverberation. [0013] In some embodiments, the method further includes: estimating the mean of the speech based on the speech probability and