CN-121999793-A - Real-time voice enhancement method and system based on adaptive deep neural network

CN121999793ACN 121999793 ACN121999793 ACN 121999793ACN-121999793-A

Abstract

The invention provides a real-time voice enhancement method and system based on a self-adaptive deep neural network, and belongs to the technical field of voice signal processing. The method comprises the following steps of collecting multipath voice signals through a multi-microphone array, restraining environmental noise through a beam forming algorithm, processing the voice signals through a mixed noise reduction method of wavelet transformation and self-adaptive filtering, constructing a dynamic neural network model, dynamically adjusting network weights according to noise environment classification results, enhancing the voice signals, carrying out frequency band energy compensation on the enhanced voice by combining with a psychoacoustic model, reducing auditory fatigue, reconstructing a time domain voice signal through inverse short time Fourier transformation, and carrying out post-processing to output finally enhanced voice. The method has the advantages of strong adaptability, high instantaneity, good voice quality and the like, and is suitable for various scenes such as voice communication, voice recognition, intelligent voice assistant and the like.

Inventors

WANG YI
XU QING
XU YAOHUA
JIANG FANG
FANG HONGYU

Assignees

安徽大学未来产业创新研究院

Dates

Publication Date: 20260508
Application Date: 20260211

Claims (8)

1. A method for real-time speech enhancement based on an adaptive deep neural network, the method comprising: collecting multiple paths of original time domain voice signals by using a multi-microphone array; framing and windowing the multipath original time domain voice signals, and obtaining multipath frequency domain signals through short-time Fourier transform; Carrying out spatial filtering processing on the multipath frequency domain signals based on a beam forming algorithm so as to enhance a target sound source signal and inhibit environmental noise, and reconstructing the spatially filtered single path frequency domain signals into time domain voice signals based on inverse short time Fourier transform; Performing multi-scale transformation decomposition and self-adaptive filtering treatment on the time domain voice signals, and reconstructing a filtering result and extracting a frequency domain signal to obtain a frequency domain amplitude spectrum and a phase spectrum; inputting the frequency domain amplitude spectrum into a self-adaptive deep neural network, dynamically adjusting network parameters according to a noise environment classification result, and outputting an enhanced frequency domain amplitude spectrum; Performing frequency band energy adjustment on the enhanced frequency domain amplitude spectrum based on a psychoacoustic model, and performing energy adjustment on different frequency bands according to auditory masking effect to obtain an energy-adjusted frequency domain amplitude spectrum; Combining the frequency domain amplitude spectrum and the phase spectrum after energy adjustment, and reconstructing an enhanced time domain voice signal through inverse short time Fourier transform.
2. The adaptive deep neural network-based real-time speech enhancement method of claim 1, wherein the beamforming algorithm comprises delay-sum beamforming and minimum-variance undistorted response beamforming.
3. The method for real-time speech enhancement based on adaptive deep neural network according to claim 1, wherein the steps of performing multi-scale transform decomposition and adaptive filtering processing on the time domain speech signal, and reconstructing and extracting the filtering result and the frequency domain signal to obtain a frequency domain magnitude spectrum and a phase spectrum comprise: performing wavelet transformation decomposition on the time domain voice signals to obtain sub-band signals with multiple scales; Based on an adaptive filtering algorithm, carrying out noise estimation on each sub-band signal, and filtering the estimated noise from the corresponding sub-band signal to obtain each sub-band signal after filtering; Carrying out wavelet inverse transformation on the filtered sub-band signals, and reconstructing to obtain denoised time domain voice signals; and performing short-time Fourier transform on the denoised time domain voice signal to obtain a corresponding frequency domain amplitude spectrum and a corresponding phase spectrum.
4. The method for real-time speech enhancement based on adaptive deep neural network according to claim 1, wherein the adaptive deep neural network is a dynamic neural network, and is capable of dynamically adjusting network parameters including convolution kernel weights and activation function parameters according to noise environment classification results.
5. The method for real-time speech enhancement based on adaptive deep neural network according to claim 1, wherein the step of performing energy adjustment of different frequency bands according to auditory masking effects by performing frequency band energy adjustment on the enhanced frequency domain amplitude spectrum based on a psychoacoustic model, and obtaining the energy-adjusted frequency domain amplitude spectrum comprises: Dividing the enhanced frequency domain amplitude spectrum into a plurality of frequency band components based on a critical frequency band theory, and calculating auditory masking thresholds corresponding to the frequency band components; Calculating an actual energy value of each frequency band component, and performing energy adjustment on the frequency band classification according to the actual energy value and a corresponding auditory masking threshold, wherein the energy adjustment is energy enhancement or energy suppression; And counting the energy adjustment results of the frequency band components to obtain the frequency domain amplitude spectrum after energy adjustment.
6. The method of claim 5, wherein the step of calculating the actual energy value of the band component and adjusting the energy of the band class according to the actual energy value and the corresponding auditory masking threshold comprises: Calculating an actual energy value for the band component and comparing the actual energy value to a corresponding auditory masking threshold: if the actual energy value is greater than or equal to the corresponding auditory masking threshold, performing energy suppression processing on the frequency band component; If the actual energy value is smaller than the corresponding auditory masking threshold, the energy enhancement processing is performed on the frequency band component.
7. The method for real-time speech enhancement based on adaptive deep neural network according to claim 1, wherein after the energy-adjusted frequency domain amplitude spectrum and phase spectrum are fused and the time domain speech signal is reconstructed by inverse short time fourier transform, the method further comprises: and performing direct current component removal processing and gain adjustment on the reconstructed time domain voice signal to obtain a final time domain voice signal.
8. A real-time speech enhancement system based on an adaptive deep neural network, the system comprising: The signal acquisition module is used for acquiring multiple paths of original time domain voice signals by using the multiple microphone arrays; The preprocessing module is used for framing and windowing the multipath original time domain voice signals and obtaining multipath frequency domain signals through short-time Fourier transform; The beam forming module is used for carrying out spatial filtering processing on the multiple paths of frequency domain signals based on a beam forming algorithm so as to strengthen a target sound source signal and inhibit environmental noise, and reconstructing the single path of frequency domain signals after spatial filtering into time domain voice signals based on inverse short time Fourier transform; the filtering module is used for carrying out multi-scale transformation decomposition and self-adaptive filtering treatment on the time domain voice signals, and carrying out reconstruction and frequency domain signal extraction on filtering results to obtain frequency domain amplitude spectrum and phase spectrum; the self-adaptive module is used for inputting the frequency domain amplitude spectrum into the self-adaptive deep neural network, dynamically adjusting network parameters according to the noise environment classification result and outputting the enhanced frequency domain amplitude spectrum; The adjusting module is used for adjusting the frequency band energy of the enhanced frequency domain amplitude spectrum based on a psychoacoustic model, and adjusting the energy of different frequency bands according to auditory masking effect to obtain the frequency domain amplitude spectrum after energy adjustment; And the enhancement module is used for combining the frequency domain amplitude spectrum after energy adjustment with the phase spectrum and reconstructing an enhanced time domain voice signal through inverse short time Fourier transform.

Description

Real-time voice enhancement method and system based on adaptive deep neural network Technical Field The invention relates to the technical field of voice signal processing, in particular to a real-time voice enhancement method and system based on a self-adaptive deep neural network. Background The voice enhancement technology is to extract a clear target voice signal from a voice signal with noise, so as to improve voice quality and intelligibility. With the rapid development of technologies such as voice communication, voice recognition, intelligent voice assistant, etc., the robustness and real-time requirements of voice enhancement in noisy environments are increasing. However, conventional speech enhancement methods still face a number of challenges when dealing with complex noise environments. First, the limitations of the conventional speech enhancement method. The traditional voice enhancement method mainly comprises spectral subtraction, wiener filtering, adaptive filtering and the like. These methods perform well in certain scenarios, but have limited effectiveness in handling non-stationary Noise, low Signal-to-Noise Ratio (SNR) signals, and complex acoustic environments. Including inaccurate noise estimation, speech distortion, and insufficient real-time. Second, in recent years, deep neural networks (Deep Neural Network, DNN) have made significant progress in the field of speech enhancement. DNN can effectively improve the voice enhancement performance by learning the mapping relation between a large number of noisy voices and pure voices. However, existing DNN-based speech enhancement methods may have distortion or unnaturalness in the hearing of the enhanced speech signal, affecting the user experience. Disclosure of Invention The invention provides a real-time voice enhancement method and a real-time voice enhancement system based on a self-adaptive deep neural network, which are used for solving the technical problem that the voice enhancement in a complex noise environment is distorted or unnatural. The invention provides a real-time voice enhancement method based on a self-adaptive deep neural network, which comprises the steps of acquiring multiple paths of original time domain voice signals by using a multi-microphone array, carrying out framing and windowing processing on the multiple paths of original time domain voice signals, obtaining multiple paths of frequency domain signals through short-time Fourier transform, carrying out spatial filtering processing on the multiple paths of frequency domain signals based on a beam forming algorithm to enhance a target sound source signal and inhibit environmental noise, reconstructing a single path of frequency domain signal after spatial filtering into a time domain voice signal based on inverse short-time Fourier transform, carrying out multi-scale transform decomposition and self-adaptive filtering processing on the time domain voice signals, carrying out reconstruction and frequency domain signal extraction on filtering results to obtain a frequency domain amplitude spectrum and a phase spectrum, inputting the frequency domain amplitude spectrum into the self-adaptive deep neural network, dynamically adjusting network parameters according to noise environment classification results, outputting the frequency domain amplitude spectrum after enhancement, carrying out frequency band energy adjustment on the frequency domain amplitude spectrum based on a psychological acoustic model, carrying out energy adjustment on different frequency bands according to the auditory effect, obtaining the frequency domain amplitude spectrum after energy adjustment, carrying out frequency domain amplitude spectrum after the energy adjustment, and carrying out inverse Fourier transform on the frequency domain amplitude spectrum after the frequency domain amplitude adjustment, and carrying out reconstruction. In one embodiment of the invention, the beamforming algorithm includes delay-and-sum beamforming and minimum-variance undistorted response beamforming. In one embodiment of the invention, the steps of performing multi-scale transformation decomposition and adaptive filtering processing on the time domain voice signal, reconstructing a filtering result and extracting a frequency domain signal to obtain a frequency domain amplitude spectrum and a phase spectrum comprise the steps of performing wavelet transformation decomposition on the time domain voice signal to obtain a plurality of scale sub-band signals; Based on the self-adaptive filtering algorithm, noise estimation is carried out on each sub-band signal, the estimated noise is filtered from the corresponding sub-band signal to obtain each sub-band signal after filtering, wavelet inverse transformation is carried out on each sub-band signal after filtering to reconstruct to obtain a time domain voice signal after denoising, and short-time Fourier transformation is carried out on the time domain voice si