CN-120636439-B - Improved ELM voice enhancement method and device for noise reduction

CN120636439BCN 120636439 BCN120636439 BCN 120636439BCN-120636439-B

Abstract

The invention discloses an improved ELM voice enhancement method and device for noise reduction, which belong to the field of voice signal processing, and concretely comprise the steps of firstly, obtaining a voice signal with noise and a clean voice signal, constructing a training sample, simultaneously constructing an improved ELM voice enhancement network, training by using the training sample, storing a network weight matrix, then, obtaining a new voice signal with noise, inputting the new voice signal with noise into the trained enhancement network, outputting the noise-reduced enhanced voice signal, and the device comprises an audio acquisition module, an offline training module, an online operation module and an audio playing module, wherein the audio acquisition module acquires the voice signal sample used for training, inputs the offline training module to train the improved ELM voice enhancement network, stores the weight matrix of the network, obtains the new voice signal with noise, inputs the noise-reduced enhanced voice signal directly, and plays the noise-reduced enhanced voice signal through the audio playing module. The invention reduces the training overhead and simultaneously has lower processing time delay.

Inventors

LIU JIANBING
FENG BO
FU XIAOWEI
SHANG YINZHONG
GAO FENG
ZHU HAIBO
JIANG RUI
SONG JUPO
LIU YONGHUI

Assignees

北京方位智联科技有限公司

Dates

Publication Date: 20260512
Application Date: 20250807

Claims (4)

1. An improved ELM voice enhancement method for noise reduction is characterized by comprising the following specific steps: Step one, voice signals with noise are collected from open source data And clean speech signal Constructing a training sample set; Step two, constructing an improved ELM voice enhancement network; the improved ELM voice enhancement network architecture consists of 1 input layer, 1 statistical priori layer, 1 hidden layer and 1 output layer, wherein the number of nodes of the input layer is The number of hidden layer nodes is The number of output layer nodes is ; The statistic prior layer is initialized to be a weight matrix of all 0 Hiding with mean value Variance is Is used for generating weight matrix by Gaussian distribution Bias vector And will bias the vector Expanded into a bias matrix ; Wherein, the And (3) with Respectively representing the number of frames and the number of frequency points after the voice signal is transformed to the time-frequency domain; Training the improved ELM voice enhancement network according to the training sample set, and storing a network weight matrix; The specific training process is as follows: Step 301, training the amplitude spectrum of the sample set And (3) with Respectively inputting the data into an improved ELM voice enhancement network, and calculating a weight matrix of a statistical prior layer Expressed as: ; Wherein, the , , Representing the weight matrix Line (1) Elements of a column; Amplitude spectrum representing clean speech signal Line (1) Elements of a column; Amplitude spectrum representing noise signal Line (1) Elements of a column; Step 302, according to the weight matrix Calculating the output of the statistical prior layer Expressed as: ; Representing the amplitude spectrum of a noisy speech signal Line (1) Elements of a column; Step 303, outputting the statistics prior layer Input to hidden layer, and output of improved ELM voice enhancement network hidden layer Expressed as: ; Wherein, the Representing the sigmoid activation function, ; Step 304, according to the output of the hidden layer And clean speech signal amplitude spectrum Calculating a weight matrix of the output layer ; Expressed as: ; Wherein, the Representation calculation Moore-Penrose pseudo-inverse, ; Step 305, finally, weight matrix And Weight matrix And As a network weight matrix and storing; Step four, acquiring a new noisy speech signal, inputting the new noisy speech signal into a trained improved ELM speech enhancement network for testing to obtain a noise-reduced enhanced speech signal 。
2. The method of claim 1, wherein the training the sample set in the step one is: first, the acquired noisy speech signal is utilized And clean speech signal Calculating noise signals Expressed as: ; Wherein, the , , A sample point index representing the speech signal; Sample points representing the speech signal; Then, the voice signals with noise are respectively processed Clean speech signal And noise signal Transforming to time-frequency domain, calculating amplitude spectrum of voice signal with noise Amplitude spectrum of noise signal Amplitude spectrum of clean speech signal ; Finally, the amplitude spectrum of the voice signal with noise Amplitude spectrum of noise signal Amplitude spectrum of clean speech signal Composing training sample sets 。
3. The method of claim 1, wherein the specific process of the fourth step is: Step 401, transforming the acquired new noisy speech signal into time-frequency domain to obtain corresponding amplitude spectrum Simultaneously acquiring the phase of the voice signal with noise ; Wherein, the Representing the number of frames after transforming the new noisy speech signal into the time-frequency domain; Step 402, judging amplitude spectrum Frame number of (2) Whether or not to equal weight matrix Dimension coefficient of (2) If yes, go to step 403, otherwise, go to step 404; Step 403, according to the amplitude spectrum Enhanced magnitude spectrum for calculation test Step 406 is entered; enhanced magnitude spectrum Expressed as: ; Wherein, the Representing a matrix element dot product; step 404, weight matrix Updated to Expressed as: ; Wherein, the , Representing a weight matrix First, the All elements are listed; step 405, according to the amplitude spectrum And updated weight matrix Enhanced magnitude spectrum for calculation test Step 406 is entered; Expressed as: ; Wherein, the A frame index representing a magnitude spectrum of the noisy speech signal for testing; Representing amplitude spectra First, the All elements are listed; step 406, according to the phase Enhanced amplitude spectrum for testing Enhanced speech signal obtained by inverse short-time Fourier transform ; The number of sample points representing the enhanced speech signal.
4. The improved ELM voice enhancement device for noise reduction using the method of claim 1, comprising an audio acquisition module, an improved ELM voice enhancement network offline training module, an improved ELM voice enhancement network online operation module and an audio playing module; The audio acquisition module is used for acquiring a voice signal sample set of the improved ELM network for training and testing, inputting the training sample set into the improved ELM voice enhancement network offline training module, training the improved ELM voice enhancement network, storing a weight matrix after training, testing by using the test sample set, acquiring a new voice signal with noise, inputting the new voice signal with noise into the improved ELM voice enhancement network online operation module, directly outputting a noise-reduced enhanced voice signal, and playing the noise-reduced enhanced voice signal through the audio playing module.

Description

Improved ELM voice enhancement method and device for noise reduction Technical Field The invention belongs to the field of voice signal processing, and particularly relates to an improved ELM (extreme learning machine) voice enhancement method and device for noise reduction. Background Classical speech enhancement methods generally assume stronger conditions and weaker processing power for nonlinear noise, whose speech enhancement performance often depends on the accuracy of the ambient noise estimate. In recent years, a voice enhancement method based on deep learning shows excellent noise reduction effect by means of strong learning capability of a neural network, but the method has long training time, complex parameter tuning and difficult deployment in embedded equipment. With respect to deep neural networks, ELM does not require backward gradient propagation to update weight parameters for a single hidden layer network, and hidden layer weights and biases can be randomly generated. Therefore, the ELM network has the advantages of high training speed, low processing delay and the like. To fully exploit the advantages of ELM networks, ELM networks are introduced into the field of speech enhancement. Meanwhile, in order to further improve the suppression capability of the ELM network to nonlinear noise, a statistical prior layer is introduced into the ELM network, so that an improved ELM voice enhancement network is formed. The statistical prior layer can inhibit partial noise and simultaneously provide a direction for the optimization of the improved ELM voice enhancement network. Disclosure of Invention The invention provides an improved ELM voice enhancement method and device for noise reduction, which can obtain better voice enhancement effect compared with a classical voice enhancement method, and can reduce training expenditure and simultaneously reduce processing time delay compared with a deep voice enhancement method based on amplitude spectrum mapping, thereby being easier to be deployed on low-power-consumption equipment. The improved ELM voice enhancement method for noise reduction comprises the following steps: Step one, voice signals with noise are collected from open source data And clean speech signalConstructing a training sample set; the specific construction process of the training sample set comprises the following steps: first, the acquired noisy speech signal is utilized And clean speech signalCalculating noise signalsExpressed as: ; Wherein, the ,,A sample point index representing the speech signal; The number of samples representing the speech signal. Then, the voice signals with noise are respectively processedClean speech signalAnd noise signalTransforming to time-frequency domain, calculating amplitude spectrum of voice signal with noiseAmplitude spectrum of noise signalAmplitude spectrum of clean speech signal; Wherein, the And (3) withThe number of frames and the number of frequency points after the speech signal is converted into the time-frequency domain are respectively represented. Finally, the amplitude spectrum of the voice signal with noiseAmplitude spectrum of noise signalAmplitude spectrum of clean speech signalComposing training sample sets。 Step two, constructing an improved ELM voice enhancement network; the improved ELM voice enhancement network architecture consists of 1 input layer, 1 statistical priori layer, 1 hidden layer and 1 output layer, wherein the number of nodes of the input layer is The number of hidden layer nodes isThe number of output layer nodes is; The statistic prior layer is initialized to be a weight matrix of all 0Hiding with mean valueVariance isIs used for generating weight matrix by Gaussian distributionBias vectorAnd will bias the vectorExpanded into a bias matrix; Training the improved ELM voice enhancement network according to the training sample set, and storing a network weight matrix; The specific training process is as follows: Step 301, training the amplitude spectrum of the sample set And (3) withRespectively inputting the data into an improved ELM voice enhancement network, and calculating a weight matrix of a statistical prior layerExpressed as: ; Wherein, the ,,Representing the weight matrixLine (1)Elements of a column; Amplitude spectrum representing clean speech signal Line (1)Elements of a column; Amplitude spectrum representing noise signal Line (1)Elements of a column; Step 302, according to the weight matrix Calculating the output of the statistical prior layerExpressed as: ; Representing the amplitude spectrum of a noisy speech signal Line (1)Elements of a column; Step 303, outputting the statistics prior layer Input to hidden layer, and output of improved ELM voice enhancement network hidden layerExpressed as: ; Wherein, the Representing the sigmoid activation function,; Step 304, according to the output of the hidden layerAnd clean speech signal amplitude spectrumCalculating a weight matrix of the output layer; Expre