CN-116324985-B - Adaptive noise estimation

CN116324985BCN 116324985 BCN116324985 BCN 116324985BCN-116324985-B

Abstract

In some embodiments, a method includes dividing an audio input into speech segments and non-speech segments using at least one processor, estimating a time-varying noise spectrum of the non-speech segments using at least one processor for each frame in each non-speech segment, estimating a speech spectrum of the speech segments using at least one processor for each frame in each speech segment, identifying one or more non-speech frequency components in the speech spectrum for each frame in each speech segment, comparing the one or more non-speech frequency components to frequency components corresponding to one or more of a plurality of estimated noise spectrums, and selecting an estimated noise spectrum from the plurality of estimated noise spectrums based on a result of the comparison.

Inventors

D. Skaini
YE ZONGXIN
G. Jean Calais
M. D. deberg

Assignees

杜比实验室特许公司
杜比国际公司

Dates

Publication Date: 20260505
Application Date: 20210921
Priority Date: 20200923

Claims (20)

1. An adaptive noise estimation method, comprising: dividing, using at least one processor, the audio input into speech segments and non-speech segments; Estimating, for each frame in each non-speech segment, a time-varying noise spectrum of the non-speech segment using the at least one processor; estimating, for each frame in each speech segment, a speech spectrum of the speech segment using the at least one processor; For each frame in each speech segment, Identifying one or more non-speech frequency components in the speech spectrum; Comparing the one or more non-speech frequency components with one or more corresponding frequency components in a plurality of estimated noise spectra to obtain a distance measure for each noise spectrum, and The estimated noise spectrum with the smallest distance measure is selected as the estimated noise spectrum of the speech segment.
2. The method of claim 1, wherein the plurality of estimated noise spectra includes an estimated noise spectrum of a past non-speech segment and an estimated noise spectrum of a future non-speech segment.
3. The method of claim 1 or 2, further comprising: the at least one processor is configured to reduce noise in the audio input using the selected estimated noise spectrum.
4. The method of claim 1 or 2, further comprising obtaining a probability of speech in each frame of the audio input, and identifying a frame containing speech based on the probability.
5. The method of claim 1 or 2, wherein the time-varying noise spectrum is estimated by calculating a moving average of the power spectrum of the non-speech segments and averaging the power spectra of the current non-speech segment and at least one past non-speech segment.
6. The method of claim 1 or 2, wherein during the non-speech segments, a time-varying estimated noise spectrum is fed to a noise reduction unit configured to reduce noise in the audio input using the selected estimated noise spectrum.
7. The method of claim 2, wherein for each speech segment, an estimated noise spectrum most likely to represent noise in a current speech segment is determined using a past estimated noise spectrum before the speech segment, a future estimated noise spectrum after the speech segment, and a current speech frame.
8. The method of claim 7, wherein determining an estimated noise spectrum most likely to represent noise of the current speech segment further comprises: obtaining an average noise spectrum from a past noise spectrum of a past non-speech segment preceding the speech segment and a future noise spectrum of a future non-speech segment following the speech segment, respectively; Determining an upper frequency limit for the past noise spectrum and the future noise spectrum; Determining a cutoff frequency as the lowest of the two upper frequency limits; Calculating a distance measure between frequency components in the speech spectrum and frequency components in the noise spectrum, and A noise spectrum of the past noise spectrum or the future noise spectrum having a smallest distance measure up to the cut-off frequency is selected as an estimated noise spectrum of the audio input.
9. The method of claim 8, wherein the distance metric is averaged over a set of speech frames in a speech segment.
10. A method as claimed in claim 1 or 2, wherein speech components are estimated in the speech segments of the audio signal, and then subtracted from the actual speech components to obtain the remaining spectrum as estimated non-speech frequency components.
11. A non-transitory computer-readable storage medium having instructions stored thereon that, when executed by one or more processors, cause the one or more processors to perform the operations of any of the preceding method claims 1-9.
12. An audio processor, comprising: a divider unit configured to divide the audio input into a speech segment and a non-speech segment; An averaging unit configured to estimate a speech spectrum for each speech segment and a time-varying noise spectrum for each non-speech segment; a similarity measurement unit configured to: identifying one or more non-speech frequency components in the speech spectrum; Comparing the one or more non-speech frequency components with one or more corresponding frequency components in a plurality of estimated noise spectra to obtain a distance measure for each noise spectrum, and The estimated noise spectrum with the smallest distance measure is selected as the estimated noise spectrum of the speech segment.
13. The audio processor of claim 12, wherein the plurality of estimated noise spectra includes an estimated noise spectrum of a past non-speech segment and an estimated noise spectrum of a future non-speech segment.
14. The audio processor of claim 12 or 13, further comprising: a noise reduction unit configured to reduce noise in the audio input using the selected estimated noise spectrum.
15. The audio processor of claim 14, wherein, during the non-speech segments, the noise reduction unit is configured to receive the non-speech segments and reduce noise in the audio input using the selected estimated noise spectrum.
16. The audio processor of claim 14, wherein the noise reduction unit is configured to reduce noise in the audio input using the selected estimated noise spectrum by comparing the spectrum of the audio input to the selected estimated noise spectrum and applying gain reduction to a frequency band where the energy of the audio input is less than the energy of the noise spectrum plus a predefined threshold.
17. The audio processor of claim 12 or 13, wherein a Voice Activity Detector (VAD) is configured to obtain probabilities of speech in each frame of the audio input and to identify frames containing speech based on the probabilities.
18. The audio processor of claim 12 or 13, wherein the averaging unit is configured to estimate the time-varying noise spectrum by calculating a moving average of the power spectra of the non-speech segments and averaging the power spectra of the current non-speech segment and at least one past non-speech segment.
19. The audio processor of claim 12 or 13, wherein for each speech segment the similarity measure unit is configured to determine an estimated noise spectrum most likely representing noise in a current speech segment based on a past estimated noise spectrum before the speech segment, a future estimated noise spectrum after the speech segment, and a current speech frame.
20. The audio processor of claim 19, wherein the similarity measure unit is configured to determine an estimated noise spectrum most likely representing noise of the current speech segment by: obtaining an average noise spectrum from a past noise spectrum of a past non-speech segment preceding the speech segment and a future noise spectrum of a future non-speech segment following the speech segment, respectively; Determining an upper frequency limit for the past noise spectrum and the future noise spectrum; Determining a cutoff frequency as the lowest of the two upper frequency limits; Calculating a distance measure between frequency components in the speech spectrum and frequency components in the noise spectrum, and A noise spectrum of the past noise spectrum or the future noise spectrum having a smallest distance measure up to the cut-off frequency is selected as an estimated noise spectrum of the audio input.

Description

Adaptive noise estimation Cross Reference to Related Applications The present application claims priority from U.S. provisional application No. 63/120,253, filed on 12/2/2020, U.S. provisional application No. 63/168,998, filed on 3/31 2021, and spanish patent application No. P202030960, filed on 9/23 2020, each of which is incorporated herein by reference in its entirety. Technical Field The present disclosure relates generally to audio signal processing, and in particular to estimating noise floor in an audio signal for noise reduction. Background Noise estimation is typically used to reduce stationary noise in audio recordings. Typically, the noise estimate is obtained by analyzing the energy in each frequency band of the audio recording segment that contains only noise. However, in some audio recordings, stationary noise may vary smoothly and/or abruptly over time. Some examples of such abrupt changes include audio recordings where the background ambient noise abruptly changes over time (e.g., fans in a room are turned on or off), and audio content obtained by editing together different audio recordings each having a different noise floor (such as a podcast containing a series of interviews recorded at different locations). In addition, noise variations typically do not occur during sufficiently long non-speech segments, and thus noise variations may not be detected and estimated early in the audio recording. Some existing methods use a single estimate of the noise floor using a segment of audio recording that contains only noise. Other prior methods analyze the entire audio recording converging to a single underlying noise floor. However, a disadvantage of both methods is that they cannot accommodate varying noise levels or spectra. Other existing methods estimate the minimum envelope of energy in each band and track the estimated minimum envelope over time (e.g., by smoothing the estimated minimum envelope using an appropriate time constant). However, these existing methods are typically used in real-time online audio signal processing architectures and cannot accurately react to sudden changes in noise in the audio recording. Disclosure of Invention Embodiments for adaptive noise estimation are disclosed. In some embodiments, an adaptive noise estimation method includes dividing an audio input into speech segments and non-speech segments using at least one processor, estimating a time-varying noise spectrum of the non-speech segments using the at least one processor for each frame in each non-speech segment, estimating a speech spectrum of the speech segments using the at least one processor for each frame in each speech segment, identifying one or more non-speech frequency components in the speech spectrum for each frame in each speech segment, comparing the one or more non-speech frequency components to frequency components corresponding to one or more of a plurality of estimated noise spectrums, and selecting an estimated noise spectrum from the plurality of estimated noise spectrums based on a result of the comparison. In an embodiment, the method further comprises reducing noise in the audio input using the selected estimated noise spectrum, using the at least one processor. In some embodiments, the method further comprises obtaining a probability of speech in each frame of the audio input and identifying a frame as containing speech based on the probability. In some embodiments, the time-varying noise spectrum is estimated by calculating a moving average of the power spectrum of the non-speech segments and averaging the power spectra of the current non-speech segment and at least one past non-speech segment. In some embodiments, during the non-speech segments, a time-varying estimated noise spectrum is fed to a noise reduction unit configured to reduce noise in the audio input using the selected estimated noise spectrum. In some embodiments, for each speech segment, an estimated noise spectrum most likely to represent noise in the current speech segment is determined using a past estimated noise spectrum before the speech segment, a future estimated noise spectrum after the speech segment, and a current speech frame. In some embodiments, determining the estimated noise spectrum most likely to represent noise of the current speech segment further comprises obtaining an average noise spectrum from a past noise spectrum of a past non-speech segment before the speech segment and a future noise spectrum of a future non-speech segment after the speech segment, respectively, determining an upper frequency limit for the past noise spectrum and the future noise spectrum, determining a cut-off frequency as the lowest of the two upper frequency limits, calculating a distance measure between frequency components in the speech spectrum and frequency components in the noise spectrum, and selecting as the estimated noise spectrum of the audio input a noise spectrum having the smallest distance measure up to th