JP-2026514356-A - Low latency noise suppression

JP2026514356AJP 2026514356 AJP2026514356 AJP 2026514356AJP-2026514356-A

Abstract

The device includes one or more processors configured to acquire audio data representing one or more audio signals. The audio data includes a first segment and a second segment following the first segment. The processors are configured to perform one or more transformation operations on the first segment to generate frequency-domain audio data. The processors are configured to provide input data based on the frequency-domain audio data as input to one or more machine learning models to generate a noise-suppressed output. The processors are configured to perform one or more inverse transformation operations on the noise-suppressed output to generate time-domain filter coefficients. The processors are configured to perform time-domain filtering on the second segment using the time-domain filter coefficients to generate a noise-suppressed output signal.

Inventors

ジェイコブ・ジョン・ビーン
ロジェリオ・ゲデス・アルベス
バヒド・モンタゼリ
エリック・ヴィッサー

Assignees

クアルコム，インコーポレイテッド

Dates

Publication Date: 20260511
Application Date: 20240321
Priority Date: 20240320

Claims (20)

It is a device, It comprises one or more processors, and the one or more processors Acquiring audio data representing one or more audio signals, wherein the audio data includes a first segment and a second segment following the first segment. To generate frequency domain audio data, one or more transformation operations are performed on the first segment, To generate a noise-suppressed output, input data based on the frequency-domain audio data is provided as input to one or more machine learning models. To generate time-domain filter coefficients, one or more inverse transform operations are performed on the noise suppression output, To generate a noise-suppressed output signal, time-domain filtering of the second segment is performed using the time-domain filter coefficients, A device configured to perform the following actions.
The device according to claim 1, wherein the input data includes the frequency domain audio data.
The device according to claim 1, wherein one or more machine learning models are configured to generate an output including a frequency mask representing the estimated magnitude of noise in the frequency-domain audio data for each frequency bin of a plurality of frequency bins, and the noise suppression output includes the frequency mask.
The device according to claim 1, wherein one or more machine learning models are configured to generate an output including noise-suppressed audio data, and one or more processors are configured to determine a frequency mask representing the estimated magnitude of noise in the frequency-domain audio data for each frequency bin of a plurality of frequency bins, based on the noise-suppressed audio data, and the noise-suppressed output includes the frequency mask.
The device according to claim 1, wherein, in order to generate the noise suppression output, one or more processors are configured to perform a beamforming operation on the frequency-domain audio data to determine beamformed audio data that distinguishes between portions of the audio data from a target audio source and portions of the audio data from a non-target audio source, and the input data includes the beamformed audio data.
The device according to claim 1, wherein, in order to process the frequency-domain audio data to generate the noise-suppressed output, one or more processors are configured to perform speech augmentation operations to determine speech augmented audio data, and the input data includes the speech augmented audio data.
The device according to claim 1, wherein, in order to process the frequency-domain audio data to generate the noise-suppressed output, one or more processors are configured to perform a source-separation operation to determine source-separated audio data, and the input data includes the source-separated audio data.
The device according to claim 1, wherein the time-domain filter coefficients include linear phase finite impulse response (FIR) filter coefficients, minimum phase FIR filter coefficients, autoregressive filter coefficients, infinite impulse response (IIR) filter coefficients, or all-pole filter coefficients.
The device according to claim 1, wherein one or more of the processors are integrated into a wearable device.
The device according to claim 1, further comprising one or more microphones, wherein the one or more audio signals are received from the one or more microphones.
The device according to claim 10, further comprising an adaptive noise cancellation filter coupled to at least one of the one or more microphones.
The device according to claim 1, further comprising one or more speakers and one or more microphones coupled to the one or more processors and integrated into a wearable device, wherein the one or more microphones include at least one external microphone configured to generate the audio data and at least one feedback microphone configured to generate a feedback signal based on the sound generated by the one or more speakers in response to the noise-suppressed output signal.
Acquiring audio data representing one or more audio signals, wherein the audio data includes a first segment and a second segment following the first segment. To generate frequency domain audio data, one or more transformation operations are performed on the first segment, To generate a noise-suppressed output, input data based on the frequency-domain audio data is provided as input to one or more machine learning models. To generate time-domain filter coefficients, one or more inverse transform operations are performed on the noise suppression output, To generate a noise-suppressed output signal, time-domain filtering of the second segment is performed using the time-domain filter coefficients, Methods that include...
The method according to claim 13, wherein the input data includes the frequency domain audio data.
The method according to claim 13, wherein one or more machine learning models are configured to generate an output including a frequency mask representing the estimated magnitude of noise in the frequency-domain audio data for each frequency bin of a plurality of frequency bins, and the noise suppression output includes the frequency mask.
The method according to claim 13, wherein one or more machine learning models are configured to generate an output including noise-suppressed audio data, and the method further includes determining a frequency mask representing the estimated magnitude of noise in the frequency-domain audio data for each frequency bin of a plurality of frequency bins, and the noise-suppressed output includes the frequency mask.
The method according to claim 13, further comprising performing a beamforming operation on the frequency-domain audio data to determine beamformed audio data that distinguishes between a portion of the audio data from a target audio source and a portion of the audio data from a non-target audio source, wherein the input data includes the beamformed audio data.
The method according to claim 13, further comprising performing a speech augmentation operation to determine speech augmented audio data, wherein the input data includes the speech augmented audio data.
The method according to claim 13, further comprising performing a source separation operation to determine source-separated audio data, wherein the input data includes the source-separated audio data.
A non-temporary computer-readable medium for storing instructions, wherein when an instruction is executed by one or more processors, the one or more processors, Acquiring audio data representing one or more audio signals, wherein the audio data includes a first segment and a second segment following the first segment. To generate frequency domain audio data, one or more transformation operations are performed on the first segment, To generate a noise-suppressed output, input data based on the frequency-domain audio data is provided as input to one or more machine learning models. To generate time-domain filter coefficients, one or more inverse transform operations are performed on the noise suppression output, To generate a noise-suppressed output signal, time-domain filtering of the second segment is performed using the time-domain filter coefficients, A non-temporary computer-readable medium capable of performing the following actions.

Description

(Cross-reference of related applications) This application claims priority from U.S. Provisional Patent Application No. 63/493,158, filed on 30 March 2023, and U.S. Non-Provisional Patent Application No. 18/611,308, filed on 20 March 2024, both owned by the same applicant, the entire contents of which are expressly incorporated herein by reference. This disclosure relates, in general terms, to low-latency noise suppression. Various types of hearing-related problems affect a significant number of people. For example, one common problem is that even people with relatively normal hearing may find it difficult to hear speech in noisy environments, and this problem can be considerably worse for people with hearing loss. For some individuals, speech is easily understandable only when the signal-to-noise ratio (of speech relative to ambient noise) exceeds a certain level. Wearable devices (e.g., earbuds, headphones, hearing aids, etc.) can be used in many situations to improve hearing, situational awareness, speech clarity, and other aspects. Generally, such devices apply relatively simple noise suppression processes to remove as much ambient noise as possible. While such noise suppression processes can improve the signal-to-noise ratio enough to make speech understandable, these processes may also reduce the user's situational awareness because they simply try to remove as much noise as possible, thereby potentially eliminating important environmental cues such as traffic noise. The use of more complex noise suppression processes can introduce significant latency. Latency in processing real-time speech can lead to user dissatisfaction. According to one implementation of this disclosure, the device includes one or more processors configured to acquire audio data representing one or more audio signals. The audio data includes a first segment and a second segment following the first segment. One or more processors are configured to perform one or more transformation operations on the first segment to generate frequency-domain audio data. One or more processors are configured to provide input data based on the frequency-domain audio data as input to one or more machine learning models to generate a noise-suppressed output. One or more processors are configured to perform one or more inverse transformation operations on the noise-suppressed output to generate time-domain filter coefficients. One or more processors are configured to perform time-domain filtering on the second segment using the time-domain filter coefficients to generate a noise-suppressed output signal. According to another implementation of this disclosure, the method includes acquiring audio data representing one or more audio signals. The audio data includes a first segment and a second segment following the first segment. The method includes performing one or more transformation operations on the first segment to generate frequency-domain audio data. The method includes providing input data based on the frequency-domain audio data as input to one or more machine learning models to generate a noise-suppressed output. The method includes performing one or more inverse transformation operations on the noise-suppressed output to generate time-domain filter coefficients. The method includes performing time-domain filtering on the second segment using the time-domain filter coefficients to generate a noise-suppressed output signal. According to another implementation of this disclosure, a non-transient computer-readable medium stores instructions executable by one or more processors to cause one or more processors to acquire audio data representing one or more audio signals. The audio data includes a first segment and a second segment following the first segment. The instructions are executable to cause one or more processors to perform one or more transformation operations on the first segment to generate frequency-domain audio data. The instructions are executable to cause one or more processors to provide input data based on the frequency-domain audio data as input to one or more machine learning models to generate a noise-suppressed output. The instructions are executable to cause one or more processors to perform one or more inverse transformation operations on the noise-suppressed output to generate time-domain filter coefficients. The instructions are executable to cause one or more processors to perform time-domain filtering on the second segment using the time-domain filter coefficients to generate a noise-suppressed output signal. According to another implementation of this disclosure, the apparatus includes means for performing one or more transformation operations on a first segment of audio data to generate frequency-domain audio data, wherein the audio data includes a first segment and a second segment following the first segment. The apparatus also includes means for processing input data based on the frequency-domain audio data as input t