CN-116343810-B - Voice noise reduction method and device, storage medium and equipment

CN116343810BCN 116343810 BCN116343810 BCN 116343810BCN-116343810-B

Abstract

The application discloses a voice noise reduction method, a device, a storage medium and equipment, wherein the method comprises the steps of firstly obtaining target voice to be noise reduced, then generating amplitude spectrum characteristics of the target voice, then inputting the amplitude spectrum characteristics of the target voice into a pre-constructed voice noise reduction model, and recognizing to obtain noise voice and clean voice in the target voice, wherein the voice noise reduction model is obtained by jointly training noisy voice data of different types and/or scenes and a first objective function, a second objective function and a third objective function. Therefore, the application constructs the voice noise reduction model by utilizing the noisy voice data of different types and/or scenes and the first objective function, the second objective function and the third objective function, then inputs the amplitude spectrum characteristics of the target voice into the noise reduction model for noise reduction treatment, can estimate the noise in the signal more pertinently, and then eliminates the noise from the target voice to obtain clean voice, thereby effectively improving the noise reduction effect.

Inventors

ZHAO XIANG
LIANG MENG
FU ZHONGHUA

Assignees

西安讯飞超脑信息科技有限公司

Dates

Publication Date: 20260505
Application Date: 20221230

Claims (8)

1. A method of voice noise reduction, comprising: acquiring target voice to be noise reduced; Generating amplitude spectrum characteristics of the target voice; Inputting the amplitude spectrum characteristics of the target voice into a pre-constructed voice noise reduction model, and recognizing to obtain noise voice and clean voice in the target voice; The voice noise reduction model comprises an encoding layer, a convolution layer, an attention layer and a decoding layer, wherein the voice noise reduction model is obtained by training noisy voice data of different types and/or scenes together with a first objective function, a second objective function and a third objective function; the voice noise reduction model is constructed as follows: Acquiring noise sample voices of different types and/or scenes; Superposing the noise sample voice and the clean sample voice to obtain sample voice, and numbering the sample voice according to the classification of the noise sample voice; generating amplitude spectrum characteristics of the sample voice; The method comprises the steps of training an initial voice noise reduction model by using amplitude spectrum characteristics of sample voices, a first objective function, a second objective function, a third objective function and the clean sample voices to generate a voice noise reduction model, wherein the first objective function is a mean square loss function, the mean square loss function reduces errors between the clean voices output by the voice noise reduction model and the clean sample voices, the second objective function is a cosine similarity loss function, the cosine similarity loss function acts on the decoding layer and is used for reducing similarity of noise sample voices of different types and/or scenes, the similarity of noise sample voices of the same types and/or scenes is increased, each sample voice is optimized in a direction in which channel information cannot be distinguished, the third objective function is an average absolute error loss function, and the average absolute error loss function acts on the attention layer and is used for distributing weighting coefficients generated by the attention layer to output characteristics of a correct convolution layer.
2. The method of claim 1, wherein before the superimposing the noise sample speech and the clean sample speech to obtain sample speech, the method further comprises: Performing format unified processing on the noise sample voice and the clean sample voice to obtain a preprocessed noise sample voice and a preprocessed clean sample voice; The step of performing superposition processing on the noise sample voice and the clean sample voice to obtain sample voice includes: And superposing the preprocessed noise sample voice and the clean sample voice to obtain the sample voice.
3. The method according to claim 1, wherein the method further comprises: Acquiring noise verification voices of different types and/or scenes; performing superposition processing on the noise verification voice and the clean verification voice to obtain verification voice, and numbering the verification voice according to classification of the noise verification voice; Generating amplitude spectrum characteristics of the verification voice; inputting the amplitude spectrum characteristics of the verification voice into the voice noise reduction model to obtain a noise voice prediction result and a clean voice prediction result in the verification voice; And when the clean voice prediction result of the verification voice is inconsistent with the clean verification voice, the verification voice is used as the sample voice again, and the voice noise reduction model is updated.
4. A method according to any of claims 1-3, characterized in that the signal-to-noise ratio of the clean sample speech is not lower than 25dB.
5. The method according to claim 1, wherein the inputting the amplitude spectrum features of the target voice into a pre-constructed voice noise reduction model, and identifying the noise voice and the clean voice in the target voice comprises: inputting the amplitude spectrum characteristics of the target voice into an encoding layer of the voice noise reduction model to obtain a high-dimensional voice characterization vector of the target voice; Inputting the high-dimensional voice characterization vector of the target voice into a convolution layer of the voice noise reduction model to obtain feature vectors of the target voice under different types and/or scenes; Inputting the high-dimensional voice characterization vector of the target voice and the feature vectors of the target voice under different types and/or scenes into the attention layer of the voice noise reduction model for weighting treatment, so as to obtain weighted voice characterization vectors; And inputting the weighted voice characterization vector into a decoding layer of the voice noise reduction model for decoding processing to obtain the amplitude spectrum characteristics of the noise voice in the target voice, and determining the amplitude spectrum characteristics of the clean voice in the target voice by utilizing the amplitude spectrum characteristics of the target voice and the amplitude spectrum characteristics of the noise voice so as to determine the noise voice and the clean voice in the target voice.
6. A speech noise reduction device, comprising: the first acquisition unit is used for acquiring target voice to be noise reduced; the first generation unit is used for generating the amplitude spectrum characteristics of the target voice; The noise reduction unit is used for inputting the amplitude spectrum characteristics of the target voice into a pre-constructed voice noise reduction model, and recognizing and obtaining noise voice and clean voice in the target voice; The voice noise reduction model comprises an encoding layer, a convolution layer, an attention layer and a decoding layer, wherein the voice noise reduction model is obtained by training noisy voice data of different types and/or scenes together with a first objective function, a second objective function and a third objective function; The apparatus further comprises: the second acquisition unit is used for acquiring noise sample voices of different types and/or scenes and acquiring clean sample voices; The first superposition unit is used for carrying out superposition processing on the noise sample voice and the clean sample voice to obtain sample voice, and numbering the sample voice according to the classification of the noise sample voice; a second generating unit, configured to generate an amplitude spectrum feature of the sample speech; The system comprises a training unit, a noise reduction model, a mean square loss function, a cosine similarity loss function, a third objective function and a third objective function, wherein the training unit is used for training an initial speech noise reduction model by using amplitude spectrum characteristics of sample speech, the first objective function, the second objective function, the third objective function and the clean sample speech to generate a speech noise reduction model, the first objective function is a mean square loss function, the mean square loss function is used for reducing errors between clean speech output by the speech noise reduction model and the clean sample speech, the second objective function is a cosine similarity loss function, the cosine similarity loss function is used for acting on the decoding layer and used for reducing similarity of noise sample speech of different types and/or scenes, the similarity of noise sample speech of the same types and/or scenes is increased, each sample speech is optimized in a direction of indistinguishable channel information, the third objective function is an average absolute error loss function, and the average absolute error loss function is used for acting on the attention layer and used for distributing weighting coefficients generated by the attention layer to output characteristics of a correct convolution layer.
7. A voice noise reduction device is characterized by comprising a processor, a memory and a system bus; the processor and the memory are connected through the system bus; The memory is for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of claims 1-5.
8. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein instructions, which when run on a terminal device, cause the terminal device to perform the method of any of claims 1-5.

Description

Voice noise reduction method and device, storage medium and equipment Technical Field The present application relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a storage medium, and a device for noise reduction. Background With the continuous breakthrough of artificial intelligence technology and the increasing popularization of various intelligent terminal devices, the occurrence frequency of human-computer interaction in daily work and life of people is higher and higher. The voice interaction is used as a next generation man-machine interaction mode, so that great convenience can be brought to life of people, and noise environments are more and more complex, and noise reduction requirements for voice signals are more and more urgent. The traditional voice noise reduction method generally comprises two methods, namely a noise reduction method based on traditional statistical signal processing, wherein the noise reduction method focuses on how to estimate the frequency spectrum characteristics of noise, has good noise reduction effect on stationary noise such as white noise and the like, has controllable algorithm and high stability, but has poor noise reduction effect on non-stationary burst noise, and a supervised regression training method based on a neural network, wherein the noise reduction method is realized by taking a noise reduction task as a regression problem, and the final target is based on the information on the signal level, which is related to the minimization of second-order variance between an output estimated signal and a target signal, no matter how a model structure changes. Therefore, as models become more complex, the noise reduction effect of such methods becomes more difficult to optimize and promote. Disclosure of Invention The embodiment of the application mainly aims to provide a voice noise reduction method, a device, a storage medium and equipment, which can effectively improve noise reduction effect when voice noise reduction is carried out. The embodiment of the application provides a voice noise reduction method, which comprises the following steps: acquiring target voice to be noise reduced; Generating amplitude spectrum characteristics of the target voice; Inputting the amplitude spectrum characteristics of the target voice into a pre-constructed voice noise reduction model, and recognizing to obtain noise voice and clean voice in the target voice; The voice noise reduction model comprises an encoding layer, a convolution layer, an attention layer and a decoding layer, and is obtained by training the voice noise reduction model by utilizing noisy voice data of different types and/or scenes and a first objective function, a second objective function and a third objective function. In a possible implementation manner, the voice noise reduction model is constructed as follows: Acquiring noise sample voices of different types and/or scenes; Superposing the noise sample voice and the clean sample voice to obtain sample voice, and numbering the sample voice according to the classification of the noise sample voice; generating amplitude spectrum characteristics of the sample voice; And training an initial voice noise reduction model by using the amplitude spectrum characteristics of the sample voice, the first objective function, the second objective function, the third objective function and the clean sample voice to generate a voice noise reduction model. In one possible implementation manner, the first objective function is a mean square loss function, the mean square loss function reduces errors between clean voice output by the voice noise reduction model and the clean sample voice, the second objective function is a cosine similarity loss function, the cosine similarity loss function acts on the decoding layer and is used for reducing similarity of noise sample voices of different types and/or scenes, noise sample voices of the same types and/or scenes are similar, and optimizing the noise sample voices in a direction of indistinguishable channel information is increased, the third objective function is an average absolute error loss function, the average absolute error loss function acts on the attention layer and is used for distributing weighting coefficients generated by the attention layer to output features of a correct convolution layer. In a possible implementation manner, before the superimposing processing is performed on the noise sample voice and the clean sample voice to obtain a sample voice, the method further includes: Performing format unified processing on the noise sample voice and the clean sample voice to obtain a preprocessed noise sample voice and a preprocessed clean sample voice; The step of performing superposition processing on the noise sample voice and the clean sample voice to obtain sample voice includes: And superposing the preprocessed noise sample voice and the clean sample voice to obta