CN-122024753-A - Real-time voice enhancement method and system based on double-stage spectrum subtraction and double-mask fusion
Abstract
The invention provides a real-time voice enhancement method and a system based on double-stage spectrum subtraction and double-mask fusion, and relates to the technical field of voice signal processing, wherein the method comprises the steps of respectively carrying out framing, windowing and short-time Fourier transformation on a left channel mixed signal and a right channel noise reference signal to obtain a complex frequency spectrum and an amplitude spectrum of the left channel signal and an amplitude spectrum of the right channel noise reference signal; and carrying out noise estimation by adopting a first noise multiplication factor based on the amplitude spectrum of the right channel noise reference signal to obtain a noise estimation spectrum, and carrying out constraint spectrum subtraction on the amplitude spectrum of the left channel signal to obtain the voice amplitude estimation of the first stage. The invention combines the over-estimated spectrum subtraction and the double-mask fusion through the framing windowing and the frequency domain conversion, and realizes the effective suppression of TTS noise and the real-time voice enhancement through the two-stage gain application and the time domain reconstruction.
Inventors
- LIU JUN
- LIU BO
- WANG XUEYU
Assignees
- 北京智子新星科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260203
Claims (10)
- 1. A real-time speech enhancement method based on dual-stage spectral subtraction and dual-mask fusion, the method comprising: step 100, framing, windowing and short-time Fourier transformation are respectively carried out on the left channel mixed signal and the right channel noise reference signal to obtain a complex frequency spectrum and an amplitude spectrum of the left channel signal and an amplitude spectrum of the right channel noise reference signal; Step 200, based on the amplitude spectrum of the right channel noise reference signal, performing noise estimation by adopting a first noise multiplication factor to obtain a noise estimation spectrum, and performing constraint spectrum subtraction on the amplitude spectrum of the left channel signal to obtain a voice amplitude estimation of a first stage; step 300, according to the voice amplitude estimation and noise estimation spectrum of the first stage, calculating an amplitude ratio mask and a wiener mask, and fusing according to the first mask mixing proportion to generate a frequency domain gain function of the first stage; Step 400, applying the frequency domain gain function of the first stage to the complex spectrum of the left channel signal, and obtaining an enhanced time domain signal of the first stage through inverse short time Fourier transform; Step 500, carrying out short-time Fourier transform again on the enhanced time domain signal in the first stage to obtain an amplitude spectrum of the enhanced time domain signal, and carrying out noise estimation, constraint spectrum subtraction, double-mask calculation and fusion in the second stage by adopting a second noise multiplication factor and a second mask mixing proportion based on the amplitude spectrum and the amplitude spectrum of the right channel noise reference signal to generate a frequency domain gain function in the second stage; Step 600, performing convolution smoothing processing on the frequency domain gain function of the second stage, and applying the convolution smoothing processing to the complex frequency spectrum corresponding to the enhanced time domain signal of the first stage, and performing inverse short-time Fourier transform and overlap-add reconstruction to obtain the enhanced real-time voice signal.
- 2. The method for real-time speech enhancement based on dual-stage spectral subtraction and dual-mask fusion according to claim 1, wherein said step 100 comprises: respectively storing the input left channel mixed signal and the right channel noise reference signal into corresponding annular buffer areas; When the data volume of the annular buffer zone reaches the preset frame length, respectively extracting data segments with corresponding lengths from the left and right sound channel buffer zones to form a current processing frame; applying a window function to the current processing frame to carry out windowing processing so as to obtain a windowed processing frame; and respectively performing short-time Fourier transform on the windowed processing frames to obtain a complex frequency spectrum and a magnitude spectrum of the left channel signal and a magnitude spectrum of the right channel noise reference signal.
- 3. The method for real-time speech enhancement based on dual-stage spectral subtraction and dual-mask fusion according to claim 2, wherein said step 200 comprises: Amplifying by adopting a first noise multiplication factor based on the amplitude spectrum of the right channel noise reference signal to obtain a noise estimation spectrum; subtracting the amplitude spectrum of the left channel signal from the noise estimation spectrum to obtain a preliminary voice amplitude spectrum; And applying constraint to the preliminary voice amplitude spectrum based on a preset reserved proportion and spectrum background noise limit to obtain voice amplitude estimation in the first stage.
- 4. The method of real-time speech enhancement based on dual-stage spectral subtraction and dual-mask fusion of claim 3, wherein said step 300 comprises: calculating the ratio of the voice amplitude estimation of the first stage to the amplitude spectrum of the left channel signal based on the voice amplitude estimation of the first stage and the amplitude spectrum of the left channel signal to obtain an amplitude ratio mask; respectively carrying out square operation on the voice amplitude estimation and the noise estimation spectrum in the first stage to obtain a corresponding voice power spectrum and a corresponding noise power spectrum; Adding the voice power spectrum and the noise power spectrum to obtain a total power spectrum; Calculating the ratio of the voice power spectrum to the total power spectrum to obtain a wiener mask; And adopting a first mask mixing proportion, carrying out weighted linear fusion on the amplitude ratio mask and the wiener mask, and generating a frequency domain gain function of a first stage.
- 5. The method of real-time speech enhancement based on dual phase spectral subtraction and dual mask fusion of claim 4, wherein said step 400 comprises: multiplying the frequency domain gain function of the first stage with the amplitude spectrum of the complex frequency spectrum of the left channel signal to obtain a corrected amplitude spectrum; combining the corrected amplitude spectrum with the phase spectrum of the left channel signal to form a corrected complex frequency spectrum; Performing inverse short-time Fourier transform on the modified complex frequency spectrum to obtain a preliminary enhanced time domain frame; the preliminary enhanced time domain frame is input to an overlap-add buffer as an enhanced time domain signal of a first stage.
- 6. The method for real-time speech enhancement based on dual-stage spectral subtraction and dual-mask fusion according to claim 5, wherein said step 500 comprises: Performing short-time Fourier transform on the enhanced time domain signal of the first stage to obtain a complex frequency spectrum of the input signal of the second stage, and extracting an amplitude spectrum of the complex frequency spectrum; Amplifying by adopting a second noise multiplication factor based on the amplitude spectrum of the right channel noise reference signal to obtain a noise estimation spectrum of a second stage; Subtracting the amplitude spectrum of the second-stage input signal from the noise estimation spectrum of the second stage to obtain a preliminary voice amplitude spectrum of the second stage; And applying constraint to the preliminary voice amplitude spectrum of the second stage based on a preset reserved proportion and spectrum background noise limit to obtain voice amplitude estimation of the second stage.
- 7. The method of real-time speech enhancement based on dual phase spectral subtraction and dual mask fusion of claim 6, wherein said step 500 further comprises: Calculating the ratio of the voice amplitude estimation of the second stage to the amplitude spectrum of the input signal of the second stage based on the voice amplitude estimation of the second stage and the amplitude spectrum of the input signal of the second stage, and obtaining an amplitude ratio mask of the second stage; Respectively carrying out square operation on the voice amplitude estimation of the second stage and the noise estimation spectrum of the second stage to obtain a voice power spectrum of the second stage and a noise power spectrum of the second stage; adding the voice power spectrum of the second stage with the noise power spectrum of the second stage to obtain a total power spectrum of the second stage; Calculating the ratio of the voice power spectrum of the second stage to the total power spectrum of the second stage to obtain a wiener mask of the second stage; And adopting a second mask mixing proportion to perform weighted linear fusion on the amplitude ratio mask of the second stage and the wiener mask of the second stage, and generating a frequency domain gain function of the second stage.
- 8. The method of real-time speech enhancement based on dual phase spectral subtraction and dual mask fusion of claim 7, wherein said step 600 comprises: Carrying out convolution smoothing treatment on the frequency domain gain function of the second stage along a frequency axis to obtain a smoothed gain function; multiplying the smoothed gain function with the amplitude spectrum of the complex frequency spectrum of the second-stage input signal to obtain a final corrected amplitude spectrum; combining the final corrected magnitude spectrum with the phase spectrum of the complex spectrum of the second stage input signal to form a final corrected complex spectrum; Performing inverse short-time Fourier transform on the final modified complex spectrum to obtain a final enhanced time domain frame; And carrying out overlap accumulation on the final enhanced time domain frames, and obtaining continuous enhanced voice signals through normalization processing.
- 9. A real-time speech enhancement system based on dual-stage spectral subtraction and dual-mask fusion, the system implementing the method of any of claims 1 to 8, comprising: The preprocessing module is used for framing, windowing and short-time Fourier transformation of the left channel mixed signal and the right channel noise reference signal respectively to obtain a complex frequency spectrum and an amplitude spectrum of the left channel signal and an amplitude spectrum of the right channel noise reference signal; The first stage processing module is used for carrying out noise estimation by adopting a first noise multiplication factor based on the amplitude spectrum of the right channel noise reference signal to obtain a noise estimation spectrum, and carrying out constraint spectrum subtraction on the amplitude spectrum of the left channel signal to obtain a first stage voice amplitude estimation; The first stage application module is used for applying a frequency domain gain function of the first stage to a complex frequency spectrum of the left channel signal, and obtaining an enhanced time domain signal of the first stage through inverse short-time Fourier transform; the second stage processing module is used for carrying out short-time Fourier transform on the enhanced time domain signal of the first stage again to obtain an amplitude spectrum of the enhanced time domain signal, and carrying out noise estimation, constraint spectrum subtraction, double mask calculation and fusion of the second stage by adopting a second noise multiplication factor and a second mask mixing proportion based on the amplitude spectrum and the amplitude spectrum of the right channel noise reference signal to generate a frequency domain gain function of the second stage; The enhancement processing module is used for carrying out convolution smoothing processing on the frequency domain gain function of the second stage, and is applied to complex frequency spectrums corresponding to the enhanced time domain signal of the first stage, and the enhanced real-time voice signal is obtained through inverse short-time Fourier transform and overlap-add reconstruction.
- 10. A computing device, comprising: one or more processors; Storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the method of any of claims 1 to 8.
Description
Real-time voice enhancement method and system based on double-stage spectrum subtraction and double-mask fusion Technical Field The invention relates to the technical field of voice signal processing, in particular to a real-time voice enhancement method and system based on double-stage spectral subtraction and double-mask fusion. Background In embedded applications such as smart voice interactions, real-time separation of human voice and noise from mixed audio is a great challenge, especially when the noise source is a text-to-speech (TTS) synthesized sound. The existing voice enhancement method has the following defects at the data processing level, and can be difficult to consider the effect and the strict constraint of the embedded equipment. Firstly, a deep learning-based model relies on a complex network to perform end-to-end mapping, a large amount of matrix multiplication and nonlinear activation operation are required to be executed, the single reasoning time is often more than 100 milliseconds, the low delay requirement of real-time interaction cannot be met, the parameter amount is large, the occupied memory is large, the embedded device resource upper limit is exceeded, the generalization capability is limited by training data, the adaptation to unseen TTS variants or acoustic environments is poor, secondly, the conventional single-channel spectral subtraction directly performs subtraction operation on an amplitude spectrum, but due to the fact that TTS voice is highly similar to real voice frequency spectrum, the method lacks resolving capability of deep features such as harmonic waves and periods, the voice distortion is possibly caused by excessive suppression, or insufficient residual voice noise is possibly caused, the performance is further deteriorated under low signal-to-noise ratio, finally, the multi-microphone beam forming technology relies on space processing of array signals, the complex operations such as covariance matrix calculation and feature decomposition are required to be completed in real time, the calculation load is large, the array calibration and the positioning accuracy requirement of acoustic sources is extremely high, and the array aperture and the space resolution is difficult to effectively filter when the array size and the cost limit is limited. Disclosure of Invention The technical problem to be solved by the invention is to provide a real-time voice enhancement method and a system based on double-stage spectral subtraction and double-mask fusion, wherein the method and the system realize effective inhibition of TTS noise and real-time voice enhancement through two-stage gain application and time domain reconstruction by combining frame windowing and frequency domain conversion and overestimation spectral subtraction and double-mask fusion. In order to solve the technical problems, the technical scheme of the invention is as follows: In a first aspect, a method of real-time speech enhancement based on dual-stage spectral subtraction and dual-mask fusion, the method comprising: Framing, windowing and short-time Fourier transformation are respectively carried out on the left channel mixed signal and the right channel noise reference signal to obtain a complex frequency spectrum and an amplitude spectrum of the left channel signal and an amplitude spectrum of the right channel noise reference signal; Based on the amplitude spectrum of the right channel noise reference signal, carrying out noise estimation by adopting a first noise multiplication factor to obtain a noise estimation spectrum, and carrying out constraint spectrum subtraction on the amplitude spectrum of the left channel signal to obtain voice amplitude estimation in a first stage; According to the voice amplitude estimation and noise estimation spectrum of the first stage, calculating an amplitude ratio mask and a wiener mask, and fusing according to a first mask mixing proportion to generate a frequency domain gain function of the first stage; Applying the frequency domain gain function of the first stage to the complex frequency spectrum of the left channel signal, and obtaining an enhanced time domain signal of the first stage through inverse short-time Fourier transform; Based on the amplitude spectrum of the amplitude spectrum and the amplitude spectrum of the right channel noise reference signal, adopting a second noise multiplication factor and a second mask mixing proportion, executing noise estimation, constrained spectrum subtraction, double mask calculation and fusion of the second stage, and generating a frequency domain gain function of the second stage; And carrying out convolution smoothing processing on the frequency domain gain function of the second stage, applying the convolution smoothing processing to a complex frequency spectrum corresponding to the enhanced time domain signal of the first stage, and carrying out inverse short-time Fourier transform and overlap-add reconstruction