US-20260128055-A1 - HYBRID AHS: A HYBRID OF KALMAN FILTER AND DEEP LEARNING FOR ACOUSTIC HOWLING SUPPRESSION

US20260128055A1US 20260128055 A1US20260128055 A1US 20260128055A1US-20260128055-A1

Abstract

Method, apparatus, and non-transitory storage medium for hybrid acoustic howling suppression based on a frequency filter model and a deep neural network are provided. The method may include receiving a speech signal, the speech signal including target speech, feedback, and noise, and inputting the speech signal into a trained hybrid neural-network based howling suppression model, wherein the trained hybrid neural-network based howling suppression model is trained using training speech signal and pre-processed acoustic feedback from a first frequency filter model. The method may also include generating an enhanced speech signal with suppressed howling as an output of the trained hybrid neural-network based howling suppression model, wherein the enhanced speech signal is used to update parameters of the first frequency filter model.

Inventors

Hao Zhang
Meng Yu
Dong Yu

Assignees

Tencent America LLC

Dates

Publication Date: 20260507
Application Date: 20251230

Claims (20)

1 . A method of hybrid acoustic howling suppression, the method being executed by at least one processor, the method comprising: receiving a speech signal, the speech signal including target speech, feedback, and noise; inputting the speech signal into a trained hybrid neural-network based howling suppression model, wherein the trained hybrid neural-network based howling suppression model is trained using pre-processed acoustic feedback from a first frequency filter model; and generating an enhanced speech signal with suppressed howling as an output of the trained hybrid neural-network based howling suppression model.
2 . The method of claim 1 , wherein training the hybrid neural-network based howling suppression model comprises: generating a teacher speech signal, the teacher speech signal comprising a modified microphone signal, wherein the modified microphone signal comprises a target speech signal, a training noise signal, and a one-time playback signal, wherein the one-time playback signal is based on the target speech signal, and wherein the one-time playback signal replaces feedback in an initial microphone signal; and training the hybrid neural-network based howling suppression model for speech separation using the teacher speech signal and the pre-processed acoustic feedback from the first frequency filter model.
3 . The method of claim 2 , wherein training the hybrid neural-network based howling suppression model for speech separation is based on a combined loss function, the combined loss function comprising a first component based on scale-invariance signal-to-distortion ratio and a second component based on a mean absolute error of spectrum magnitude in a frequency domain.
4 . The method of claim 2 , wherein training the hybrid neural-network based howling suppression model for speech separation comprises: generating at least two reference signals, the at least two reference signals comprising a first intermediate signal based on the teacher speech signal and a second intermediate signal based on the pre-processed acoustic feedback from the first frequency filter model; and training the hybrid neural-network based howling suppression model for speech separation using the teacher speech signal, the pre-processed acoustic feedback from the first frequency filter model, the first intermediate signal, and the second intermediate signal.
5 . The method of claim 2 , wherein the pre-processed acoustic feedback from the first frequency filter model is used only for training the hybrid neural-network based howling suppression model.
6 . The method of claim 2 , wherein the pre-processed acoustic feedback from the first frequency filter model is not used for generating the enhanced speech signal with suppressed howling as the output of the trained hybrid neural-network based howling suppression model.
7 . The method of claim 1 , wherein the trained hybrid neural-network based howling suppression model is trained in an offline manner.
8 . The method of claim 1 , wherein the first frequency filter model is based on a Kalman Filter.
9 . An apparatus for hybrid acoustic howling suppression, the apparatus comprising: at least one memory configured to store program code; and at least one processor configured to read the program code and operate as instructed by the program code, the program code including: first receiving code configured to cause the at least one processor to receive a speech signal, the speech signal comprising target speech, feedback, and noise; first inputting code configured to cause the at least one processor to input the speech signal into a trained hybrid neural-network based howling suppression model, wherein the trained hybrid neural-network based howling suppression model is trained using pre-processed acoustic feedback from a first frequency filter model; and first generating code configured to cause the at least one processor to generate an enhanced speech signal with suppressed howling as an output of the trained hybrid neural-network based howling suppression model.
10 . The apparatus of claim 9 , wherein training the hybrid neural-network based howling suppression model comprises: second generating code configured to cause the at least one processor to generate a teacher speech signal, the teacher speech signal comprising a modified microphone signal, wherein the modified microphone signal comprises a target speech signal, a training noise signal, and a one-time playback signal, wherein the one-time playback signal is based on the target speech signal, and wherein the one-time playback signal replaces feedback in an initial microphone signal; and first training code configured to cause the at least one processor to train the hybrid neural-network based howling suppression model for speech separation using the teacher speech signal and the pre-processed acoustic feedback from the first frequency filter model.
11 . The apparatus of claim 10 , wherein training the hybrid neural-network based howling suppression model for speech separation is based on a combined loss function, the combined loss function comprising a first component based on scale-invariance signal-to-distortion ratio and a second component based on a mean absolute error of spectrum magnitude in a frequency domain.
12 . The apparatus of claim 10 , wherein training the hybrid neural-network based howling suppression model for speech separation is based on a combined loss function, the combined loss function comprising a first component based on scale-invariance signal-to-distortion ratio and a second component based on a mean absolute error of spectrum magnitude in a frequency domain.
13 . The apparatus of claim 10 , wherein training the hybrid neural-network based howling suppression model for speech separation comprises: third generating code configured to cause the at least one processor to generate at least two reference signals, the at least two reference signals comprising a first intermediate signal based on the teacher speech signal and a second intermediate signal based on the pre-processed acoustic feedback from the first frequency filter model; and second training code configured to cause the at least one processor to train the hybrid neural-network based howling suppression model for speech separation using the teacher speech signal, the pre-processed acoustic feedback from the first frequency filter model, the first intermediate signal, and the second intermediate signal.
14 . The apparatus of claim 10 , wherein the pre-processed acoustic feedback from the first frequency filter model is used only for training the hybrid neural-network based howling suppression model.
15 . The apparatus of claim 10 , wherein the pre-processed acoustic feedback from the first frequency filter model is not used for generating the enhanced speech signal with suppressed howling as the output of the trained hybrid neural-network based howling suppression model.
16 . A non-transitory computer-readable medium storing instructions, the instructions comprising: one or more instructions that, when executed by one or more processors of a device for hybrid acoustic howling suppression, cause the one or more processors to: receive a speech signal, the speech signal comprising target speech, feedback, and noise; input the speech signal into a trained hybrid neural-network based howling suppression model, wherein the trained hybrid neural-network based howling suppression model is trained using pre-processed acoustic feedback from a first frequency filter model; and generate an enhanced speech signal with suppressed howling as an output of the trained hybrid neural-network based howling suppression model.
17 . The non-transitory computer-readable medium of claim 16 , wherein training the hybrid neural-network based howling suppression model comprises: generating a teacher speech signal, the teacher speech signal comprising a modified microphone signal, wherein the modified microphone signal comprises a target speech signal, a training noise signal, and a one-time playback signal, wherein the one-time playback signal is based on the target speech signal, and wherein the one-time playback signal replaces feedback in an initial microphone signal; and training the hybrid neural-network based howling suppression model for speech separation using the teacher speech signal and the pre-processed acoustic feedback from the first frequency filter model.
18 . The non-transitory computer-readable medium of claim 17 , wherein training the hybrid neural-network based howling suppression model for speech separation is based on a combined loss function, the combined loss function comprising a first component based on scale-invariance signal-to-distortion ratio and a second component based on a mean absolute error of spectrum magnitude in a frequency domain.
19 . The non-transitory computer-readable medium of claim 17 , wherein training the hybrid neural-network based howling suppression model for speech separation comprises: generating at least two reference signals, the at least two reference signals comprising a first intermediate signal based on the teacher speech signal and a second intermediate signal based on the pre-processed acoustic feedback from the first frequency filter model; and training the hybrid neural-network based howling suppression model for speech separation using the teacher speech signal, the pre-processed acoustic feedback from the first frequency filter model, the first intermediate signal, and the second intermediate signal.
20 . The non-transitory computer-readable medium of claim 17 , wherein the pre-processed acoustic feedback from the first frequency filter model is not used for generating the enhanced speech signal with suppressed howling as the output of the trained hybrid neural-network based howling suppression model.

Description

CROSS-REFERENCE TO RELATED APPLICATION This application is a continuation of U.S. application Ser. No. 18/318,910, filed on May 17, 2023, the contents of which are incorporated by reference herein in its entirety. BACKGROUND Acoustic howling arises when sound from the speaker's end is captured by the microphone on the same end, leading to a feedback loop that amplifies the sound until it becomes unbearable. Acoustic howling has become a crucial problem in video/audio conference and acoustic amplification systems. Several additional methods have been proposed, including passive methods like physical isolation of microphones and speakers, and active methods such as gain reduction, notch filters, and adaptive filtering. Among these methods, adaptive filtering may dynamically adjust the signal in real-time to prevent the feedback loop and lead to relatively better speech quality. However, the adaptive filter can be sensitive to control parameters and interferences and fails to address non-linear distortions introduced by amplifiers and loudspeakers. In related art, deep learning has been recently introduced for efficient acoustic howling suppression (AHS). However, the recurrent nature of howling creates a mismatch between offline training and streaming inference, limiting the quality of enhanced speech. As stated above, acoustic howling is a phenomenon that arises in sound reinforcement systems where the sound emitted from speakers is picked up by a microphone and re-amplified recursively in a feedback loop, resulting in an unpleasant high-pitched sound. This can occur in different settings such as concerts, presentations, public address systems, and hearing aids. AHS refers to the process of reducing or eliminating the occurrence of acoustic howling. Therefore, it is crucial to have robust and effective solutions that can address this discrepancy between training the deep learning model and inferring from the deep learning model for acoustic howling suppression (AHS) in a joint manner, taking into account the complex acoustics of video/audio conference and acoustic amplification systems. SUMMARY According to embodiments, a method for hybrid acoustic howling suppression based on a frequency filter model and a deep neural network may be provided. The method may include receiving a speech signal, the speech signal including target speech, feedback, and noise; inputting the speech signal into a trained hybrid neural-network based howling suppression model, wherein the trained hybrid neural-network based howling suppression model is trained using training speech signal and pre-processed acoustic feedback from a first frequency filter model; and generating an enhanced speech signal with suppressed howling as an output of the trained hybrid neural-network based howling suppression model, wherein the enhanced speech signal is used to update parameters of the first frequency filter model. According to embodiments, an apparatus for hybrid acoustic howling suppression based on a frequency filter model and a deep neural network may be provided. The apparatus may include at least one memory configured to store program code; and at least one processor configured to read the program code and operate as instructed by the program code. The program may include first receiving code configured to cause the at least one processor to receive a speech signal, the speech signal comprising target speech, feedback, and noise; first inputting code configured to cause the at least one processor to input the speech signal into a trained hybrid neural-network based howling suppression model, wherein the trained hybrid neural-network based howling suppression model is trained using training speech signal and pre-processed acoustic feedback from a first frequency filter model; and first generating code configured to cause the at least one processor to generate an enhanced speech signal with suppressed howling as an output of the trained hybrid neural-network based howling suppression model, wherein the enhanced speech signal is used to update parameters of the first frequency filter model. According to embodiments, a non-transitory computer-readable medium storing instructions may be provided. The instructions, when executed by at least one processor for hybrid acoustic howling suppression based on a frequency filter model and a deep neural network, may cause the one or more processors to receive a speech signal, the speech signal comprising target speech, feedback, and noise; input the speech signal into a trained hybrid neural-network based howling suppression model, wherein the trained hybrid neural-network based howling suppression model is trained using training speech signal and pre-processed acoustic feedback from a first frequency filter model; and generate an enhanced speech signal with suppressed howling as an output of the trained hybrid neural-network based howling suppression model, wherein the enhanced speech signal is used to