CN-116153323-B - Training method and device of voice noise reduction network, electronic equipment and storage medium
Abstract
The application provides a training method and device of a voice noise reduction network, electronic equipment and a computer readable storage medium, which comprise the steps of carrying out short-time Fourier transform on sample voice data in a sample data set to obtain sample time-frequency domain characteristics, wherein the sample voice data comprises noise voice data and clean voice data, the sample time-frequency domain characteristics comprise noise time-frequency domain characteristics and clean time-frequency domain characteristics, calculating the noise time-frequency domain characteristics through a neural network model to obtain predicted time-frequency domain characteristics, evaluating differences between the predicted time-frequency domain characteristics and the clean time-frequency domain characteristics through a loss function to obtain a function value, judging whether the function value is smaller than a preset loss threshold, switching the loss function in a stepwise manner in the training process, and determining that the neural network model converges to obtain the voice noise reduction network. According to the scheme, under the condition that model parameters are not increased, the voice noise reduction network with the noise reduction amount and the voice fidelity effect is obtained through training.
Inventors
- CHEN JINMING
- LI QIAN
Assignees
- 恒玄科技(上海)股份有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20230220
Claims (9)
- 1. A method for training a voice noise reduction network, comprising: Performing short-time Fourier transform on sample voice data in a sample data set to obtain sample time-frequency domain characteristics, wherein the sample voice data comprises noise voice data and clean voice data corresponding to the noise voice data, and the sample time-frequency domain characteristics comprise noise time-frequency domain characteristics obtained by the transformation of the noise voice data and clean time-frequency domain characteristics obtained by the transformation of the clean voice data; calculating the noise time-frequency domain characteristics through a neural network model to obtain predicted time-frequency domain characteristics; Evaluating the difference between the predicted time-frequency domain features and the clean time-frequency domain features through a loss function to obtain a function value, and adjusting model parameters of the neural network model based on the function value; Judging whether the function value is smaller than a preset loss threshold value or not, wherein the loss function is switched in stages in the training process; if yes, determining that the neural network model converges to obtain a voice noise reduction network; The step switching loss function in the training process comprises the steps of sequentially selecting MAE loss, MSE loss and difference fourth power loss in the training process.
- 2. The method according to claim 1, wherein the method further comprises: And if the function value is not smaller than the loss threshold, returning to the step of calculating the noise time-frequency domain characteristics through the neural network model to obtain predicted time-frequency domain characteristics.
- 3. The method of claim 1, wherein prior to said performing a short-time fourier transform on the sample speech data in the sample dataset to obtain the sample time-frequency domain features, the method further comprises: acquiring a plurality of pieces of clean voice data, and respectively generating corresponding noise voice data for each piece of clean voice data, wherein the noise voice data comprises the clean voice data and the noise data; taking each clean voice data as a sample label of the corresponding noise voice data; The sample data set is constructed based on a plurality of noisy speech data carrying sample tags.
- 4. The method of claim 1, wherein the neural network model comprises a first fully connected layer, a first feature processing module, a second feature processing module, a third feature processing module, and a second fully connected layer, wherein the first fully connected layer is connected to the first feature processing module, the first feature processing module is connected to the second feature processing module by a residual, the first fully connected layer, the first feature processing module, the second feature processing module is connected to the third feature processing module by a residual, and the third feature processing module is connected to the second fully connected layer.
- 5. The method of claim 4, wherein the first feature processing module, the second feature processing module, and the third feature processing module are recurrent neural networks.
- 6. The method of claim 1, wherein after obtaining the voice noise reduction network, the method further comprises: performing short-time Fourier transform on the voice data to be processed to obtain the characteristics of a time-frequency domain to be processed; calculating the time-frequency domain characteristics to be processed through the voice noise reduction network to obtain noise-reduced time-frequency domain characteristics; And performing inverse Fourier transform on the time-frequency domain characteristics after noise reduction to obtain voice data after noise reduction.
- 7. A training device for a speech noise reduction network, comprising: The system comprises a sample data set, a conversion module, a sampling module and a sampling module, wherein the sample data set is used for carrying out short-time Fourier transform on sample voice data in the sample data set to obtain sample time-frequency domain characteristics, the sample voice data comprises noise voice data and clean voice data corresponding to the noise voice data, and the sample time-frequency domain characteristics comprise noise time-frequency domain characteristics obtained by the conversion of the noise voice data and clean time-frequency domain characteristics obtained by the conversion of the clean voice data; The calculation module is used for calculating the noise time-frequency domain characteristics through a neural network model to obtain predicted time-frequency domain characteristics; The adjusting module is used for evaluating the difference between the predicted time-frequency domain characteristics and the clean time-frequency domain characteristics through a loss function to obtain a function value, and adjusting model parameters of the neural network model based on the function value; the judging module is used for judging whether the function value is smaller than a preset loss threshold value or not, wherein the loss function is switched in stages in the training process; The determining module is used for determining that the neural network model converges if yes, so as to obtain a voice noise reduction network; the judging module is used for sequentially selecting MAE loss, MSE loss and difference fourth power loss in the training process.
- 8. An electronic device, the electronic device comprising: A processor; A memory for storing processor-executable instructions; Wherein the processor is configured to perform the training method of the speech noise reduction network of any of claims 1-6.
- 9. A computer readable storage medium, characterized in that the storage medium stores a computer program executable by a processor to perform the training method of the speech noise reduction network of any of claims 1-6.
Description
Training method and device of voice noise reduction network, electronic equipment and storage medium Technical Field The present application relates to the field of audio processing technologies, and in particular, to a training method and apparatus for a voice noise reduction network, an electronic device, and a computer readable storage medium. Background With the technical scheme, the voice noise reduction algorithm realized by the neural network model is widely applied. However, when the neural network model is smaller, the voice noise reduction network obtained by direct end-to-end training cannot achieve the effects of both voice distortion and noise reduction. In other words, in this case, if the noise reduction network is to achieve the effect of small distortion of the noise after noise reduction, the noise reduction effect is poor, and if the noise reduction network is to achieve the effect of large noise reduction, the noise distortion is serious. This problem can generally be solved by increasing the model parameters, mainly including increasing the depth and width of the neural network model. This approach increases the neural network model, resulting in increased computational effort and memory usage. For some devices with limited hardware resources, this approach can result in the speech noise reduction process negatively affecting its operational state. In view of this, there is a need for a scheme for training a voice noise reduction network without adding model parameters. Disclosure of Invention An object of an embodiment of the present application is to provide a training method and apparatus for a voice noise reduction network, an electronic device, and a computer readable storage medium, which are used for training a voice noise reduction network that gives consideration to noise reduction and voice fidelity effects without increasing model parameters. In one aspect, the present application provides a training method for a voice noise reduction network, including: Performing short-time Fourier transform on sample voice data in a sample data set to obtain sample time-frequency domain characteristics, wherein the sample voice data comprises noise voice data and clean voice data corresponding to the noise voice data, and the sample time-frequency domain characteristics comprise noise time-frequency domain characteristics obtained by the transformation of the noise voice data and clean time-frequency domain characteristics obtained by the transformation of the clean voice data; calculating the noise time-frequency domain characteristics through a neural network model to obtain predicted time-frequency domain characteristics; Evaluating the difference between the predicted time-frequency domain features and the clean time-frequency domain features through a loss function to obtain a function value, and adjusting model parameters of the neural network model based on the function value; Judging whether the function value is smaller than a preset loss threshold value or not, wherein the loss function is switched in stages in the training process; if yes, determining that the neural network model converges, and obtaining the voice noise reduction network. In an embodiment, the method further comprises: And if the function value is not smaller than the loss threshold, returning to the step of calculating the noise time-frequency domain characteristics through the neural network model to obtain predicted time-frequency domain characteristics. In an embodiment, before the performing short-time fourier transform on the sample voice data in the sample data set to obtain the sample time-frequency domain feature, the method further includes: acquiring a plurality of pieces of clean voice data, and respectively generating corresponding noise voice data for each piece of clean voice data, wherein the noise voice data comprises the clean voice data and the noise data; taking each clean voice data as a sample label of the corresponding noise voice data; The sample data set is constructed based on a plurality of noisy speech data carrying sample tags. In an embodiment, the neural network model comprises a first full-connection layer, a first feature processing module, a second feature processing module, a third feature processing module and a second full-connection layer, wherein the first full-connection layer is connected with the first feature processing module, the first full-connection layer, the first feature processing module and the second feature processing module are in residual connection, the first full-connection layer, the first feature processing module, the second feature processing module and the third feature processing module are in residual connection, and the third feature processing module is connected with the second full-connection layer. In an embodiment, the first feature processing module, the second feature processing module, and the third feature processing module are recurrent