CN-121999791-A - Speech processing method, device, apparatus, storage medium and program product

CN121999791ACN 121999791 ACN121999791 ACN 121999791ACN-121999791-A

Abstract

The application provides a voice processing method, a voice processing device, voice processing equipment, a storage medium and a program product, and relates to the technical field of artificial intelligence. The method comprises the steps of inputting acoustic features of a voice signal into an input module of a gain prediction model to obtain first intermediate features, inputting the first intermediate features into a first circulation network module of the gain prediction model to obtain second intermediate features, inputting the second intermediate features into a second circulation network module of the gain prediction model to obtain third intermediate features, enabling an output value range of an activation function of the first circulation network module to be larger than an output value range of an activation function of the input module and an output value range of an activation function of the second circulation network module, and inputting the third intermediate features into an output module of the gain prediction model to obtain denoising gain.

Inventors

BAO FENG

Assignees

腾讯科技（深圳）有限公司

Dates

Publication Date: 20260508
Application Date: 20241107

Claims (17)

1. A method of speech processing, the method performed by a computer device, the method comprising: Inputting acoustic characteristics of a voice signal into an input module of a gain prediction model, and obtaining a first intermediate characteristic output by the input module; inputting the first intermediate feature into a first cyclic network module of the gain prediction model to obtain a second intermediate feature output by the first cyclic network module; Inputting the second intermediate feature into a second circulation network module of the gain prediction model to obtain a third intermediate feature output by the second circulation network module, wherein the output value range of the activation function of the first circulation network module is larger than the output value range of the activation function of the input module, and the output value range of the activation function of the first circulation network module is larger than the output value range of the activation function of the second circulation network module; Inputting the third intermediate feature into an output module of the gain prediction model, and outputting noise removal gain through the output module; And denoising the voice signal through the denoising gain to obtain a denoised voice signal.
2. The method of claim 1, wherein the inputting the second intermediate feature into the second looped network module of the gain prediction model to obtain a third intermediate feature output by the second looped network module comprises: And inputting the first intermediate feature and the second intermediate feature into the second circulation network module to obtain the third intermediate feature output by the second circulation network module.
3. The method of claim 2, wherein said inputting the first intermediate feature and the second intermediate feature into the second torus network module to obtain the third intermediate feature output by the second torus network module comprises: And after the first intermediate feature and the second intermediate feature are spliced end to end, inputting the first intermediate feature and the second intermediate feature into the second circulation network module, and obtaining the third intermediate feature output by the second circulation network module.
4. A method according to any one of claims 1 to 3, wherein, The activation function of the input module and the activation function of the second circulation network module are hyperbolic tangent functions tanh; the activation function of the first cyclic network module is a rectifying linear function ReLU.
5. The method of any of claims 1 to 4, wherein the input module comprises a first fully connected layer, and wherein the number of nodes of the first fully connected layer is less than the dimension of the acoustic feature.
6. The method of claim 5, wherein the dimension of the acoustic feature is 44 and the number of nodes of the first fully connected layer ranges from [20,43].
7. The method of any one of claims 1 to 6, wherein the first and second round robin network modules comprise gated round robin network GRUs.
8. The method of claim 7, wherein the step of determining the position of the probe is performed, The number of output nodes of the GRU in the first cyclic network module is in the range of [10,100]; The number of output nodes of the GRU in the second torus network module ranges from [10,100].
9. The method according to any one of claims 1 to 8, wherein the output module includes a second full-connection layer, the number of nodes of the second full-connection layer is equal to the number of critical bands divided in the frequency domain of the voice signal, the performing denoising processing on the voice signal by the denoising gain to obtain a denoised voice signal includes: and multiplying the numerical value of each dimension in the denoising gain with the signal of each critical frequency band of the voice signal in a one-to-one correspondence manner to obtain the denoising voice signal.
10. The method according to any one of claims 1 to 9, further comprising: Inputting an acoustic characteristic sample into the input module to obtain a first intermediate characteristic sample output by the input module; inputting the first intermediate feature sample into the first cyclic network module to obtain a second intermediate feature sample output by the first cyclic network module; inputting the first intermediate feature sample and the second intermediate feature sample into the second circulation network module to obtain a third intermediate feature sample output by the second circulation network module; inputting the third intermediate feature sample into the output module to obtain the predicted denoising gain output by the output module; And updating model parameters of the gain prediction model according to the difference between the prediction denoising gain and the denoising gain sample.
11. A method of speech processing, the method performed by a computer device, the method comprising: inputting the acoustic characteristic sample into an input module of a gain prediction model, and obtaining a first intermediate characteristic sample output by the input module; inputting the first intermediate feature sample into a first cyclic network module of the gain prediction model to obtain a second intermediate feature sample output by the first cyclic network module; Inputting the first intermediate feature sample and the second intermediate feature sample into a second cyclic network module of the gain prediction model to obtain a third intermediate feature sample output by the second cyclic network module, wherein the output value range of an activation function of the first cyclic network module is larger than the output value range of an activation function of the input module, and the output value range of the activation function of the first cyclic network module is larger than the output value range of the activation function of the second cyclic network module; Inputting the third intermediate feature sample into an output module of the gain prediction model to obtain a predicted denoising gain output by the output module; Updating model parameters of the gain prediction model according to the difference between the prediction denoising gain and the denoising gain sample; The model parameters are updated to the converged gain prediction model and are used for processing acoustic characteristics of the voice signals and outputting denoising gains, and the denoising gains are used for performing denoising processing on the voice signals.
12. A speech processing apparatus, the apparatus comprising: The input unit is used for inputting the acoustic characteristics of the voice signals into an input module of the gain prediction model, and obtaining first intermediate characteristics output by the input module; The first cyclic network processing unit is used for inputting the first intermediate feature into a first cyclic network module of the gain prediction model to obtain a second intermediate feature output by the first cyclic network module; the second cyclic network processing unit is used for inputting the second intermediate feature into a second cyclic network module of the gain prediction model to obtain a third intermediate feature output by the second cyclic network module, wherein the output value range of the activation function of the first cyclic network module is larger than the output value range of the activation function of the input module, and the output value range of the activation function of the first cyclic network module is larger than the output value range of the activation function of the second cyclic network module; The gain output unit is used for inputting the third intermediate characteristic into an output module of the gain prediction model, and outputting noise elimination gain through the output module; And the denoising unit is used for performing denoising processing on the voice signal through the denoising gain to obtain a denoised voice signal.
13. A speech processing apparatus, the apparatus comprising: the input unit is used for inputting the acoustic characteristic sample into an input module of the gain prediction model, and obtaining a first intermediate characteristic sample output by the input module; The first cyclic network processing unit is used for inputting the first intermediate characteristic sample into a first cyclic network module of the gain prediction model to obtain a second intermediate characteristic sample output by the first cyclic network module; The second cyclic network processing unit is used for inputting the first intermediate characteristic sample and the second intermediate characteristic sample into a second cyclic network module of the gain prediction model to obtain a third intermediate characteristic sample output by the second cyclic network module, wherein the output value range of the activation function of the first cyclic network module is larger than the output value range of the activation function of the input module, and the output value range of the activation function of the first cyclic network module is larger than the output value range of the activation function of the second cyclic network module; The output unit is used for inputting the third intermediate characteristic sample into an output module of the gain prediction model to obtain a predicted denoising gain output by the output module; A parameter updating unit, configured to update model parameters of the gain prediction model according to a difference between the predicted denoising gain and a denoising gain sample; The model parameters are updated to the converged gain prediction model and are used for processing acoustic characteristics of the voice signals and outputting denoising gains, and the denoising gains are used for performing denoising processing on the voice signals.
14. A computer device comprising a processor and a memory having instructions stored therein, the instructions being executable by the processor to implement the speech processing method of any of claims 1 to 11.
15. A computer readable storage medium having stored therein instructions for execution by a processor of a computer device to implement the speech processing method of any one of claims 1 to 11.
16. A computer program product comprising computer instructions stored on a computer readable storage medium, the computer instructions being read and executed by a processor of a computer device to implement the speech processing method of any one of claims 1 to 11.
17. A method of speech processing, the method performed by a computer device, the method comprising: Inputting the acoustic characteristics of the voice signals into a gain prediction model to obtain denoising gain output by the gain prediction model; denoising the voice signal through the denoising gain to obtain a denoised voice signal; the gain prediction model comprises an input module, a first circulation network module, a second circulation network module and an output module which are sequentially connected, wherein the output value range of an activation function of the first circulation network module is larger than that of an activation function of the input module, and the output value range of the activation function of the first circulation network module is larger than that of an activation function of the second circulation network module.

Description

Speech processing method, device, apparatus, storage medium and program product Technical Field The embodiment of the application relates to the technical field of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI), in particular to a voice processing method, a device, equipment, a storage medium and a program product. Background With the continuous development of artificial intelligence technology, the application of machine learning models is also becoming more and more widespread. In the related art, in a speech noise reduction scenario, a computer device may input a speech signal into a pre-trained recurrent neural network (Recurrent Neural Network for Audio Noise Reduction, RNNoise) algorithm model for audio noise reduction to process the speech signal to obtain a denoised speech signal. However, the RNNoise algorithm has a high computational complexity, thereby affecting the efficiency of the denoising process of the speech signal. Disclosure of Invention The embodiment of the application provides a voice processing method, a device, equipment, a storage medium and a program product, which can improve the accuracy of executing natural language processing based on an input text, and the technical scheme is as follows: In one aspect, there is provided a speech processing method, the method being performed by a computer device, the method comprising: Inputting acoustic characteristics of a voice signal into an input module of a gain prediction model, and obtaining a first intermediate characteristic output by the input module; inputting the first intermediate feature into a first cyclic network module of the gain prediction model to obtain a second intermediate feature output by the first cyclic network module; Inputting the second intermediate feature into a second circulation network module of the gain prediction model to obtain a third intermediate feature output by the second circulation network module, wherein the output value range of the activation function of the first circulation network module is larger than the output value range of the activation function of the input module, and the output value range of the activation function of the first circulation network module is larger than the output value range of the activation function of the second circulation network module; Inputting the third intermediate feature into an output module of the gain prediction model, and outputting noise removal gain through the output module; And denoising the voice signal through the denoising gain to obtain a denoised voice signal. In another aspect, there is provided a speech processing apparatus, the apparatus comprising: The input unit is used for inputting the acoustic characteristics of the voice signals into an input module of the gain prediction model, and obtaining first intermediate characteristics output by the input module; The first cyclic network processing unit is used for inputting the first intermediate feature into a first cyclic network module of the gain prediction model to obtain a second intermediate feature output by the first cyclic network module; the second cyclic network processing unit is used for inputting the second intermediate feature into a second cyclic network module of the gain prediction model to obtain a third intermediate feature output by the second cyclic network module, wherein the output value range of the activation function of the first cyclic network module is larger than the output value range of the activation function of the input module, and the output value range of the activation function of the first cyclic network module is larger than the output value range of the activation function of the second cyclic network module; The gain output unit is used for inputting the third intermediate characteristic into an output module of the gain prediction model, and outputting noise elimination gain through the output module; And the denoising unit is used for performing denoising processing on the voice signal through the denoising gain to obtain a denoised voice signal. In a possible implementation manner, the second loop network processing unit is configured to input the first intermediate feature and the second intermediate feature into the second loop network module, and obtain the third intermediate feature output by the second loop network module. In a possible implementation manner, the second loop network processing unit is configured to input the first intermediate feature and the second intermediate feature into the second loop network module after the first intermediate feature and the second intermediate feature are spliced end to end, so as to obtain the third intermediate feature output by the second loop network module. In one possible implementation, the activation function of the input module and the activation function of the second circulation network module are hyperbolic tangent functions tanh; the activation function of the first cyclic netwo