US-12626710-B2 - Method and device for processing a binaural recording

US12626710B2US 12626710 B2US12626710 B2US 12626710B2US-12626710-B2

Abstract

The present invention relates to a method and device for processing a first and a second audio signal representing an input binaural audio signal acquired by a binaural recording device. The present invention further relates to a method for rendering a binaural audio signal on a speaker system. The method for processing a binaural signal comprising extracting audio information from the first audio signal, computing a band gain for reducing noise in the first audio signal and applying the band gains to respective frequency bands of the first audio signal in accordance with a dynamic scaling factor, to provide a first output audio signal. Wherein the dynamic scaling factor has a value between zero and one and is selected so as to reduce quality degradation for the first audio signal.

Inventors

Zhiwei Shuang
Yuanxing MA
Yang Liu
Ziyu YANG
Giulio CENGARLE

Assignees

DOLBY LABORATORIES LICENSING CORPORATION
DOLBY INTERNATIONAL AB

Dates

Publication Date: 20260512
Application Date: 20210915
Priority Date: 20200915

Claims (20)

1 . A method for processing a first and a second audio signal representing an input binaural audio signal acquired by a binaural recording device, the method comprising: extracting audio information from the first audio signal, the audio information comprising a plurality of frequency bands representing the first audio signal; computing, for each frequency band of the first audio signal, a respective band gain for reducing noise in the first audio signal; computing, for each frequency band of the first audio signal, a Voice Activity Detection, VAD, probability; applying said band gains to respective frequency bands of the first audio signal in accordance with a respective dynamic scaling factor, to provide a first output audio signal, wherein said dynamic scaling factor has a value between zero and one, where a value of zero indicates that a full band gain is applied, and a value of one indicates that no band gain is applied, and wherein said dynamic scaling factor, for each frequency band, is based on the band gains associated with a corresponding frequency band of a current time frame and previous time frames of the first audio signal having a VAD probability exceeding a predetermined VAD probability threshold; performing noise reduction processing of the second audio signal to obtain a second output audio signal, and determining a binaural output audio signal based on the first and second output audio signals, wherein the binaural output audio signal is configured for playback by multiple speakers.
2 . The method according to claim 1 , wherein the noise reduction processing of the second audio signal comprises separate processing steps corresponding to the processing steps of the first audio signal.
3 . The method according to claim 1 , wherein providing the first output audio signal comprises: computing a noise reduced audio signal by applying said band gains to respective frequency bands of the first audio signal, and mixing each frequency band of the first audio signal with a corresponding frequency band of the noise reduced audio signal with a mixing ratio equal to the dynamic scaling factor to provide the first output audio signal.
4 . The method according to claim 1 , wherein providing the first output audio signal comprises: computing for each band a dynamic band gain as (k+(1−k) Bgain) where k is the dynamic scaling factor and Bgain is the computed band gain; applying the dynamic band gain for each band of first audio signal to provide the first output audio signal.
5 . The method according to claim 1 , wherein the dynamic scaling factor of each frequency band is based on band gains of corresponding frequency bands of the current and previous time frames that exceed a predetermined threshold gain.
6 . The method according to claim 1 , wherein the dynamic scaling factor is based on a weighted sum of band gains, said weighted sum including band gains from previous time frames, said method further comprising: determining whether the band gain of a specific frequency band of the current time frame exceeds a predetermined threshold gain, if the band gain associated with the specific frequency band of the current frame exceeds the predetermined threshold gain, calculating a current weighted sum as a weighted sum of the band gain of the current time frame and the weighted sum including band gains from previous time frames, if the band gain associated with the specific frequency band of the current frame is below the predetermined threshold gain, calculating the current weighted sum as the weighted sum including band gains from previous time frames.
7 . The method according to claim 1 , wherein the dynamic scaling factor is determined as 1−G, where G is a weighted sum of band gains including at least band gains from frequency bands of previous time frames.
8 . The method according to claim 1 , wherein determining the dynamic scaling factor for each frequency band is performed offline and each dynamic scaling factor is based on the band gain associated with corresponding frequency bands of all time frames of the first audio signal.
9 . The method according to claim 8 , further comprising determining a dynamic scaling factor for each frequency band of the first audio signal based on an average band gain from all frames where: the band gain exceeds a predetermined threshold gain and the VAD probability exceeds a predetermined probability threshold.
10 . The method according to claim 1 , wherein said first and second audio signals are a left channel audio signal and a right channel audio signal and said method further comprises: estimating the first audio signal as a middle channel audio signal, the middle signal being computed from a sum of the left and right signal; estimating the second audio signal as a side channel audio signal, the side signal being computed from a difference between the left and right signal; and determining the binaural output audio signal by: estimating an left output audio signal as a sum of the middle output signal and side output signal; and estimating an right output audio signal as a difference of the middle output signal and side output signal.
11 . The method according to claim 1 , further comprising processing an additional audio signal from an additional recording device and wherein said first and second audio signal is a left and right audio signal, said method further comprises: synchronizing the additional audio signal with the binaural audio signals; and mixing the additional audio signal with the left and right audio signal.
12 . The method according to claim 11 , further comprising processing a bone vibration sensor signal acquired by a bone vibration sensor, said method further comprising synchronizing the bone vibration sensor signal with the binaural audio signals; and controlling a gain of the additional audio signal based on the bone vibration sensor signal.
13 . The method according to claim 12 , further comprising processing a bone vibration sensor signal acquired by a bone vibration sensor of the binaural recording device, said method further comprising: synchronizing the bone vibration sensor signal with the binaural audio signals; extracting a VAD probability of the additional audio signal; determining, based on the VAD probability and the bone vibration sensor signal, a source of a detected voice; if the source is a wearer of the binaural recording device with the bone vibration sensor, processing the additional audio signal with a first audio processing scheme adapted to suppress a noise of a channel between the wearer of the binaural recording device and the additional recording device; if the source is other than the wearer of the binaural recording device with the bone vibration sensor, processing the additional audio signal with a second audio processing scheme adapted to suppress a noise of a channel between the other source and the additional recording device.
14 . The method according to claim 13 , wherein the first and second audio processing schemes implements different signal gains for the additional audio signal.
15 . The method according to claim 1 , wherein the audio information further comprises one or more of: an SNR of the first audio signal, a fundamental frequency of the first audio signal, the VAD probability of the first audio signal, a bone vibration sensor signal acquired by a bone vibration sensor, a fundamental frequency extracted from a bone vibration sensor signal acquired by a bone vibration sensor, and a VAD probability extracted from a bone vibration sensor signal acquired by a bone vibration sensor.
16 . The method according to claim 15 further comprising: controlling a gain of said first audio signal based on said VAD probability extracted from the bone vibration sensor signal.
17 . The method according to claim 1 , wherein computing band gains for each frequency band in the first audio signal comprises predicting the band gains from the audio information with a trained neural network.
18 . The method of claim 1 , wherein the binaural recording device includes the multiple speakers.
19 . A non-transitory computer-readable storage medium comprising a sequence of instructions which, when executed by one or more processors, cause the one or more processors to perform the method according to claim 1 .
20 . An audio processing device comprising: a receiver configured to receive an input binaural audio signal acquired by a binaural recording device, the input binaural audio signal comprising a first and a second audio signal: an extraction unit configured to receive the first audio signal from the receiver and extract audio information from the first audio signal, the audio information comprising a plurality of frequency bands representing the first audio signal; a processing device configured to receive the audio information, compute for each frequency band of the first audio signal, a band gain for reducing noise in the first audio signal and a Voice Activity Detection, VAD, probability: an application unit configured to apply said band gains to respective frequency bands of the first audio signal in accordance with a dynamic scaling factor, to provide a first output audio signal, wherein said dynamic scaling factor has a value between zero and one, where a value of zero indicates that a full band gain is applied, and a value of one indicates that no band gain is applied, and wherein said dynamic scaling factor, for each frequency band, is based on the band gain associated with a corresponding frequency band of a current time frame and previous time frames of the first audio signal having a VAD probability exceeding a predetermined VAD probability threshold; an additional processing module configured to perform noise reduction processing of the second audio signal to obtain a second output audio signal, and an output stage configured to determine a binaural output audio signal based on the first and second output audio signals, wherein the binaural output audio signal is configured for playback by multiple speakers.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS: This application is a U.S. National Stage of International Application No. PCT/US2021/050534 filed Sep. 15, 2021, which claims the benefit of priority from U.S. Provisional Patent Application 63/177,771, filed Apr. 21, 2021, U.S. Provisional Patent Application No. 63/117,717, field Nov. 24, 2020, and Spanish Patent Application No. P202030934, filed Sep. 15, 2020, each of which is hereby incorporated by reference in its entirety. TECHNICAL FIELD OF THE INVENTION The present invention relates to a method and device for processing a binaural audio signal. BACKGROUND In the area of both user generated content (UGC) and professionally generated content (PGC) binaural capture devices are often used for capturing audio. Binaural audio is for example recorded by a pair of microphones wherein each microphone is provided on an earbud of a pair of earphones worn by a user. A binaural capture device thus captures the sound at each respective ear of the user wearing the binaural capture device. Accordingly, binaural capture devices are generally good at capturing the voice of the user or the audio perceived by the user. Binaural capturing devices are accordingly often used for recording podcasts, interviews or conferences. A drawback with binaural capture devices is that the binaural capture devices are very sensitive to environmental noise which results in poor playback experience when the captured binaural signal is rendered. Another drawback of binaural capture devices is that audio sources of interest besides the voice of the user wearing the binaural capture device are picked up with very low signal strength, high noise and high reverberation. As a result, the intelligibility of other audio sources of interest featured in a captured binaural audio signal is decreased. To circumvent these drawbacks, previous solutions involve complex audio processing algorithms which are computationally cumbersome to perform making these solutions especially difficult to realize for low latency communication or UGC where complex audio processing is difficult to implement. GENERAL DISCLOSURE OF THE INVENTION Based on the above, it is therefore an object of the present invention to provide a method and device for more efficient processing of a binaural audio signal alongside a method for rendering the processed binaural audio signal. According to a first aspect of the invention there is provided a method for processing a first and a second audio signal representing an input binaural audio signal. The binaural audio signal being acquired by a binaural recording device. The method comprises extracting audio information from the first audio signal wherein the audio information comprises at least a plurality of frequency bands representing the first audio signal and computing for each frequency band a band gain for reducing noise in the first audio signal. Moreover, the method comprises applying the band gains to respective frequency bands of the first audio signal in accordance with a dynamic scaling factor to provide a first output audio signal. The dynamic scaling factor has a value between zero and one wherein a value of zero indicates that no band gain is applied and a value of one indicates that a full band gain is applied without modification. The dynamic scaling factor is selected so as to reduce quality degradation for the first audio signal and the method further comprises providing a second output audio signal based on the second audio signal and determining an binaural output audio signal based on the first and second output audio signals. The invention according to the first aspect is at least partly based on the understanding that by dynamically scaling the band gains of the frequency bands the quality degradation of the output audio signal may be decreased. Regardless of the type of noise reduction method employed to compute the noise reduction band gains, an audio signal with the band gains applied will contain undesirable audio artefacts introduced by the noise reduction processing. To mitigate these audio artefacts the band gains are applied dynamically in accordance with a dynamic scaling factor. A static or predetermined scaling factor will fail to reduce the quality degradation for a majority of possible audio signals by either implementing band gains to such a high extent that audio artefacts emerge or to such a low extent that the noise reduction is suppressed. The selection of the dynamic scaling factor may be based on the audio information and/or band gains of the audio signal to enable use of a dynamic (non-static) scaling factor tailored after the particular audio signal being processed. In some implementations the dynamic scaling factor for each frequency band is based on the band gain associated with a corresponding frequency band of a current time frame and previous time frames of the first audio signal. With a time frame it is meant a partial time segment of the first