US-12626713-B2 - Dynamic voice nullformer

US12626713B2US 12626713 B2US12626713 B2US 12626713B2US-12626713-B2

Abstract

A voice capture system including a first and second voice beamformer, a voice mixer, a voice rejected noise beamformer, a noise beamformer adjustor, a jammer suppressor, and a speech enhancer is provided. The first and second voice beamformer and the voice mixer generate a voice enhanced reference signal based on a first and second frequency domain microphone signal. The voice rejected noise beamformer includes filter weights and generates a noise reference signal based on the first and second frequency domain microphone signal. The noise beamformer adjustor adjusts the one or more filter weights of the voice rejected noise beamformer to account for fit variation. The jammer suppressor generates a jammer suppressed signal based on the voice enhanced reference signal and the noise reference signal. The speech enhancer dynamically generates an output voice signal by applying a dynamic noise suppression signal to each frequency bin of the jammer suppressed signal.

Inventors

Yang Liu
Abinaya Subramaniam
Trevor Caldwell
Douglas George MORTON

Assignees

BOSE CORPORATION

Dates

Publication Date: 20260512
Application Date: 20220811

Claims (15)

1 . A voice capture system, comprising: a minimum variance distortionless response (MVDR) beamformer configured to generate a first voice beamformer signal based on a first frequency domain microphone signal and a second frequency domain microphone signal; a delay and sum beamformer configured to generate a second voice beamformer signal based on the first frequency domain microphone signal and the second frequency domain microphone signal; a voice mixer configured to blend the first voice beamformer signal with the second voice beamformer signal to generate a voice enhanced reference signal, wherein an amount of the first beamformer signal as compared to the second beamformer signal in the voice enhanced reference signal is dynamically adjusted based on amplitudes of the first and the second beamformer signals; a voice rejected noise beamformer comprising one or more filter weights, the voice rejected noise beamformer configured to generate a noise reference signal based on the first frequency domain microphone signal and the second frequency domain microphone signal; noise beamformer adjustor configured to adjust the one or more filter weights of the voice rejected noise beamformer to account for fit variation, wherein the filter weights are adjusted based on the first frequency domain microphone signal, the second frequency domain microphone signal, and the noise reference signal; a jammer suppressor configured to generate a jammer suppressed signal based on the voice enhanced reference signal and the noise reference signal; and a speech enhancer configured to generate an output voice signal based on the jammer suppressed signal, the noise reference signal, and a voice detection signal.
2 . The voice capture system of claim 1 , further comprising a filter bank configured to: generate the first frequency domain microphone signal based on a first time domain microphone signal; and generate the second frequency domain microphone signal based on a second time domain microphone signal.
3 . The voice capture system of claim 2 , further comprising: a first microphone configured to generate the first time domain microphone signal; and a second microphone configured to generate the second time domain microphone signal.
4 . The voice capture system of claim 1 , wherein the voice detection signal is generated by a voice activity detector based on the voice enhanced reference signal and the noise reference signal.
5 . The voice capture system of claim 1 , wherein the voice rejected noise beamformer is a Wiener delay and subtract noise beamformer.
6 . The voice capture system of claim 1 , wherein the one or more filter weights of the voice rejected noise beamformer correspond to a stock voice direction or a wearer-specific voice direction.
7 . The voice capture system of claim 1 , wherein the noise beamformer adjustor is configured to: generate a signal-to-noise ratio (SNR) quality check signal based on the second frequency domain microphone signal; generate, via a quality check voice activity detector, a voice detection quality check signal; store, via a first data accumulator, first voice data corresponding to a relationship between the first frequency domain microphone signal and the second frequency domain microphone signal; store, via a second data accumulator, second voice data corresponding to an energy level of the first frequency domain microphone signal; and dynamically update, if the SNR quality check signal exceeds an SNR quality threshold, the voice detection quality check signal exceeds a voice detection quality threshold, and the first voice data or the second voice data exceeds a storage threshold, the one or more filter weights of the voice rejected noise beamformer based on the first frequency domain microphone signal and the second frequency domain microphone signal.
8 . The voice capture system of claim 1 , where the speech enhancer is configured to generate the output voice signal by: determining a series of speech signal-to-noise ratios (SNR) corresponding to a series of frequency bins based on the jammer suppressed signal and the noise reference signal; comparing the speech SNRs of each frequency bin to a set of speech enhancer thresholds; applying a noise suppression signal to each frequency bin of the jammer suppressed signal, wherein an amplitude of the noise suppression signal applied to a frequency bin of the jammer suppressed signal is related to the SNR corresponding to the frequency bin.
9 . A wearable audio device comprising: a first microphone configured to generate a first time domain microphone signal; a second microphone configured to generate a second time domain microphone signal; a filter bank configured to generate a first frequency domain microphone signal based on the first time domain microphone signal and a second frequency domain microphone signal based on the second time domain microphone signal; a minimum variance distortionless response (MVDR) beamformer configured to generate a first voice beamformer signal based on the first frequency domain microphone signal and the second frequency domain microphone signal; a delay and sum beamformer configured to generate a second voice beamformer signal based on the first frequency domain microphone signal and the second frequency domain microphone signal; a voice rejected noise beamformer comprising one or more filter weights, the voice rejected noise beamformer configured to generate a noise reference signal based on the first frequency domain microphone signal and the second frequency domain microphone signal; a noise beamformer adjustor configured to adjust the one or more filter weights of the voice rejected noise beamformer to account for fit variation, wherein the filter weights are adjusted based on the first frequency domain microphone signal, the second frequency domain microphone signal, and the noise reference signal; a voice mixer configured to blend the first voice beamformer signal with the second voice beamformer signal to generate a voice enhanced reference signal, wherein an amount of the first beamformer signal as compared to the second beamformer signal in the voice enhanced reference signal is dynamically adjusted based on amplitudes of the first and the second beamformer signals; a voice activity detector configured to generate a voice detection signal based on the voice enhanced reference signal and the noise reference signal; a jammer suppressor configured to generate a jammer suppressed signal based on the voice enhanced reference signal and the noise reference signal; and a speech enhancer configured to generate an output voice signal based on the jammer suppressed signal, the noise reference signal, and the voice detection signal.
10 . The wearable audio device of claim 9 , wherein the wearable audio device is a single side wearable device.
11 . The wearable audio device of claim 9 , wherein the noise beamformer adjustor is configured to: generate a signal-to-noise ratio (SNR) quality check signal based on the second frequency domain microphone signal; generate, via a quality check voice activity detector, a voice detection quality check signal; store, via a first data accumulator, first voice data corresponding to a relationship between the first frequency domain microphone signal and the second frequency domain microphone signal; store, via a second data accumulator, second voice data corresponding to an energy level of the first frequency domain microphone signal; and dynamically update, if the SNR quality check signal exceeds an SNR quality threshold, the voice detection quality check signal exceeds a voice detection quality threshold, and the first voice data or the second voice data exceeds a storage threshold, the one or more filter weights of the voice rejected noise beamformer based on the first frequency domain microphone signal and the second frequency domain microphone signal.
12 . The wearable audio device of claim 9 , where the speech enhancer is configured to generate the output voice signal by: determining a series of speech signal-to-noise ratios (SNR) corresponding to a series of frequency bins based on the jammer suppressed signal and the noise reference signal; comparing the speech SNRs of each frequency bin to a set of speech enhancer thresholds; applying a noise suppression signal to each frequency bin of the jammer suppressed signal, wherein an amplitude of the noise suppression signal applied to a frequency bin of the jammer suppressed signal is related to the SNR corresponding to the frequency bin.
13 . A method for voice capture, comprising: generating, via a minimum variance distortionless response (MVDR) beamformer, a first voice beamformer signal based on a first frequency domain microphone signal and a second frequency domain microphone signal; generating, via a delay and sum beamformer, a second voice beamformer signal based on the first frequency domain microphone signal and the second frequency domain microphone signal: blending, via a voice mixer, the first voice beamformer signal with the second voice beamformer signal to generate a voice enhanced reference signal, wherein an amount of the first beamformer signal as compared to the second beamformer signal in the voice enhanced reference signal is dynamically adjusted based on amplitudes of the first and the second beamformer signals; generating, via a voice rejected noise beamformer, a noise reference signal based on the first frequency domain microphone signal and the second frequency domain microphone signal; adjusting, via a noise beamformer adjustor, one or more filter weights of a voice rejected noise beamformer to account for fit variation, wherein the filter weights are adjusted based on the first frequency domain microphone signal, the second frequency domain microphone signal, and the noise reference signal; generating, via a jammer suppressor, a jammer suppressed signal based on the voice enhanced reference signal and the noise reference signal; and generating, via a speech enhancer, an output voice signal based on the jammer suppressed signal, the noise reference signal, and a voice detection signal.
14 . The method of claim 13 , further comprising: generating a signal-to-noise ratio (SNR) quality check signal based on the second frequency domain microphone signal; generating, via a quality check voice activity detector, a voice detection quality check signal based on a frequency domain feedback microphone signal or the second frequency domain microphone signal; storing, via a first data accumulator, first voice data corresponding to a relationship between the first frequency domain microphone signal and the second frequency domain microphone signal; storing, via a second data accumulator, second voice data corresponding to an energy level of the first frequency domain microphone signal; and dynamically updating, if the SNR quality check signal exceeds an SNR quality threshold, the voice detection quality check signal exceeds a voice detection quality threshold, and the first voice data or the second voice data exceeds a storage threshold, the one or more filter weights of the voice rejected noise beamformer based on the first frequency domain microphone signal and the second frequency domain microphone signal.
15 . The method of claim 13 , where the speech enhancer is configured to generate the output voice signal by: determining a series of speech signal-to-noise ratios (SNR) corresponding to a series of frequency bins based on the jammer suppressed signal and the noise reference signal; comparing the speech SNRs of each frequency bin to a set of speech enhancer thresholds; applying a noise suppression signal to each frequency bin of the jammer suppressed signal, wherein an amplitude of the noise suppression signal applied to a frequency bin of the jammer suppressed signal is related to the SNR corresponding to the frequency bin.

Description

FIELD OF THE DISCLOSURE The present disclosure is generally directed to a dynamic voice capture system for wearable audio devices. BACKGROUND One important aspect of a wearable audio device is the ability to capture voice audio from the wearer. Whether the captured speech is in the context of a voice call with another person, or entering a voice audio command in an electronic system, the clarity of the voice audio is important to the use of the device. In many cases, these wearable devices may have a wide range of in-ear or on-ear fitting variations for both an individual wearer, as well as across a variety of different wearers. In other cases, the fit of the wearable audio device may change while being worn, such as due to sweat or other factors. When the fit of the wearable audio device is different than anticipated by the manufacturer, voice capture performance may suffer due to the preprogrammed directionality of aspects of the voice capture system. Accordingly, there is a need for a voice capture system capable of dynamically adjusting according to fit variations. SUMMARY The present disclosure is generally directed to a dynamic voice capture system for wearable audio devices. Generally, in one aspect, a voice capture system is provided. The voice capture system includes a voice enhanced reference signal. The voice enhanced reference signal is based on a first frequency domain microphone signal and a second frequency domain microphone signal. The voice capture system further includes a voice rejected noise beamformer. The voice rejected noise beamformer includes one or more filter weights. The voice rejected noise beamformer is configured to generate a noise reference signal. The noise reference signal is based on the first frequency domain microphone signal and the second frequency domain microphone signal. According to an example, the voice rejected noise beamformer may be a Wiener delay and subtract noise beamformer. According to a further example, the one or more filter weights of the rejected noise beamformer correspond to a stock voice direction or a wearer-specific voice direction. The voice capture system further includes a noise beamformer adjustor. The noise beamformer adjustor is configured to adjust the one or more filter weights of the voice rejected noise beamformer to account for fit variation. The voice capture system further includes a jammer suppressor. The jammer suppressor is configured to generate a jammer suppressed signal. The jammer suppressed signal is based on the voice enhanced reference signal and the noise reference signal. The voice capture system further includes a speech enhancer. The speech enhancer is configured to generate an output voice signal. The output voice signal is based on the jammer suppressed signal, the noise reference signal, and a voice detection signal. According to an example, the voice detection signal is generated by a voice activity detector based on the voice enhanced reference signal and the noise reference signal. According to an example, the noise beamformer adjustor is configured to generate a signal-to-noise ratio (SNR) quality check signal. The SNR quality check signal is based on the second frequency domain microphone signal. The noise beamformer adjustor is further configured to generate, via a quality check voice activity detector, a voice detection quality check signal. The noise beamformer adjustor is further configured to store, via a first data accumulator, first voice data corresponding to a relationship between the first frequency domain microphone signal and the second frequency domain microphone signal. The noise beamformer adjustor is further configured to store, via a second data accumulator, second voice data corresponding to an energy level of the first frequency domain microphone signal. The noise beamformer adjustor is further configured to dynamically update, if the SNR quality check signal exceeds an SNR quality threshold, the voice detection quality check signal exceeds a voice detection quality threshold, and the first voice data or the second voice data exceeds a storage threshold, the one or more filter weights of the voice rejected noise beamformer based on the first frequency domain microphone signal and the second frequency domain microphone signal. According to an example, the speech enhancer is configured to generate the output voice signal by: (1) determining a series of speech SNRs corresponding to a series of frequency bins based on the jammer suppressed signal and the noise reference signal; (2) comparing the speech SNRs of each frequency bin to a set of speech enhancer thresholds; and (3) applying a noise suppression signal to each frequency bin of the jammer suppressed signal, wherein an amplitude of the noise suppression signal applied to a frequency bin of the jammer suppressed signal is related to the SNR corresponding to the frequency bin. According to an example, the voice enhanced reference signal is ge