CN-121999761-A - Model training method, voice processing method, electronic device and storage medium

CN121999761ACN 121999761 ACN121999761 ACN 121999761ACN-121999761-A

Abstract

The application discloses a model training method, a voice processing method, electronic equipment and a storage medium, and relates to the technical field of large models and voice processing. The method comprises the steps of obtaining a noise audio signal, carrying out voice enhancement processing on the noise audio signal by adopting an initial voice processing model to obtain an enhanced audio signal, wherein the initial voice processing model is used for carrying out double-way compression encoding on the noise audio signal to obtain an encoding result, decoding the enhanced audio signal from the encoding result, and training model parameters of the initial voice processing model based on the enhanced audio signal to generate a target voice processing model. The method solves the technical problems of large calculation amount and high calculation cost of the voice enhancement model caused by adopting a two-way modeling mode in the voice enhancement model in the related technology.

Inventors

WANG HAOXU
TIAN BIAO

Assignees

阿里巴巴（中国）有限公司

Dates

Publication Date: 20260508
Application Date: 20241101

Claims (20)

1. A method of model training, comprising: Acquiring a noise audio signal; Performing voice enhancement processing on the noise audio signal by adopting an initial voice processing model to obtain an enhanced audio signal, wherein the initial voice processing model is used for performing double-path compression coding on the noise audio signal to obtain a coding result, and decoding the enhanced audio signal from the coding result; Training model parameters of the initial voice processing model based on the enhanced audio signal to generate a target voice processing model, wherein the target voice processing model is used for carrying out voice enhancement processing on a voice interaction audio signal to be processed to generate a target audio signal.
2. The model training method of claim 1, wherein performing two-way compression encoding on the noise audio signal to obtain the encoding result, and wherein decoding the enhanced audio signal from the encoding result comprises downsampling the noise audio signal and obtaining a target hidden layer feature corresponding to the noise audio signal through two-way compression encoding in a time domain dimension and a frequency domain dimension, respectively, and decoding the enhanced audio signal from the target hidden layer feature.
3. The model training method of claim 1, wherein the initial speech processing model comprises a transformation portion, an encoding portion, a decoding portion, and a reconstruction portion, wherein performing speech enhancement processing on the noise audio signal using the initial speech processing model to obtain the enhanced audio signal comprises: performing time-frequency analysis on the noise audio signal by utilizing the transformation part to obtain a feature to be coded; Downsampling the feature to be coded by using the coding part, and respectively carrying out double-path compression coding on the time domain dimension and the frequency domain dimension to obtain a target hidden layer feature; the decoding part is adopted to decode the amplitude and the phase of the target hidden layer characteristics in parallel to obtain a target amplitude spectrum and a target phase spectrum; and reconstructing the target amplitude spectrum and the target phase spectrum by using the reconstruction part to obtain the enhanced audio signal.
4. A model training method according to claim 3, wherein performing time-frequency analysis on the noise audio signal using the transformation section, the obtaining the feature to be encoded comprises: performing Fourier transform on the noise audio signal by utilizing the transformation part to obtain an initial amplitude spectrum and an initial phase spectrum; performing power law compression on the initial amplitude spectrum according to a preset scale factor to obtain a compressed amplitude spectrum; and combining the compressed amplitude spectrum and the initial phase spectrum to generate the feature to be coded.
5. The model training method of claim 3, wherein the encoding section comprises an encoder and a dual-path compression encoding block, the downsampling the feature to be encoded by the encoding section and performing two-path compression encoding in a time domain dimension and a frequency domain dimension, respectively, to obtain the target hidden layer feature comprises: Performing convolution modeling on the feature to be coded by adopting the encoder to obtain an initial hidden layer feature; and performing downsampling two-way compression coding on the initial hidden layer characteristics in the time domain dimension and the frequency domain dimension by adopting the two-way compression coding block to obtain the target hidden layer characteristics.
6. The model training method of claim 5, wherein the dual-path compression encoding block comprises a downsampling module, a frequency domain compression encoding module, a time domain compression encoding module, and an upsampling module, wherein downsampling the initial hidden layer feature in the time domain dimension and the frequency domain dimension with the dual-path compression encoding block comprises: downsampling the initial hidden layer characteristics by adopting the downsampling module to obtain downsampled characteristics; Carrying out frequency domain modeling on the downsampling characteristics in the frequency domain dimension by adopting the frequency domain compression coding module to obtain first intermediate hidden layer characteristics; Performing time domain modeling on the first intermediate hidden layer feature in the time domain dimension by adopting the time domain compression coding module, and obtaining a second intermediate hidden layer feature; and upsampling the second intermediate hidden layer characteristic by adopting the upsampling module to obtain the target hidden layer characteristic.
7. The model training method of claim 6, wherein the downsampling module comprises at least one of a time domain downsampling module and a frequency domain downsampling module, downsampling the initial hidden layer feature using the downsampling module, the downsampling feature comprising: And performing time domain downsampling on the initial hidden layer feature in the time domain dimension by adopting the time domain downsampling module according to a preset sampling proportion, and/or performing frequency domain downsampling on the initial hidden layer feature in the frequency domain dimension by adopting the frequency domain downsampling module to obtain the downsampling feature, wherein the preset sampling proportion is larger than a preset threshold value.
8. The model training method of claim 7, wherein a plurality of the dual-path compression encoding blocks respectively employ different ones of the preset sampling ratios.
9. The model training method of claim 7, wherein the downsampling module comprises at least one of a time domain upsampling module matched with the time domain downsampling module, a frequency domain upsampling module matched with the frequency domain downsampling module, upsampling the second intermediate hidden layer feature using the upsampling module, the obtaining the target hidden layer feature comprising: And performing time domain upsampling on the second intermediate hidden layer feature in the time domain dimension by adopting the time domain upsampling module, and/or performing frequency domain upsampling on the second intermediate hidden layer feature in the frequency domain dimension by adopting the frequency domain upsampling module to obtain the target hidden layer feature.
10. The model training method of claim 6, wherein the frequency domain compression encoding module comprises a multi-head attention weighting module, a non-linear attention module, a plurality of self-attention modules, and an offset normalization module, wherein frequency domain modeling the downsampled features in the frequency domain dimension using the frequency domain compression encoding module, the deriving the first intermediate hidden layer features comprises: Calculating attention weights by adopting the multi-head attention weight module; the nonlinear attention module is adopted to convert the three-dimensional hidden layer characteristics corresponding to the downsampling characteristics into linear layer output characteristics by multiplexing the attention weights; Performing attention modeling on the linear layer output characteristics by multiplexing the attention weights by adopting the plurality of self-attention modules to obtain target characteristic representations, wherein the target characteristic representations are used for acquiring global dependency relations in the linear layer output characteristics; and normalizing the target feature representation by adopting the bias normalization module to obtain the first intermediate hidden layer feature.
11. The model training method of claim 3, wherein the decoding section comprises an amplitude spectrum decoder and a phase spectrum decoder, wherein performing amplitude and phase parallel decoding on the target hidden layer feature by using the decoding section to obtain the target amplitude spectrum and the target phase spectrum comprises: Performing amplitude decoding on the target hidden layer characteristics by adopting the amplitude spectrum decoder to obtain the target amplitude spectrum; And carrying out phase decoding on the target hidden layer characteristics by adopting the phase spectrum decoder to obtain the target phase spectrum.
12. A model training method as defined in claim 3, wherein reconstructing the target amplitude spectrum and the target phase spectrum using the reconstruction portion to obtain the enhanced audio signal comprises: and performing inverse Fourier transform on the target amplitude spectrum and the target phase spectrum by using the reconstruction part to obtain the enhanced audio signal.
13. The model training method of any of claims 1-12, wherein training model parameters of the initial speech processing model based on the enhanced audio signal, generating the target speech processing model comprises: Determining a plurality of loss information based on the enhanced audio signal, wherein the plurality of loss information includes at least partial loss of discrimination loss, transform consistency loss, amplitude loss, complex loss, phase loss, and time sampling point loss between the enhanced audio signal and a real audio signal; linearly combining the loss information to obtain target loss; Training model parameters of the initial speech processing model according to the target loss to generate the target speech processing model.
14. The model training method according to claim 13, the model training method is characterized by further comprising the following steps: and in the process of training the model parameters of the initial speech processing model according to the target loss, controlling the updating quantity of the model parameters by adopting a preset model parameter scale-based optimizer and a preset learning rate scheduling strategy.
15. A method of speech processing, comprising: Acquiring a voice interaction audio signal to be processed; performing voice enhancement processing on the voice interaction audio signal by adopting a target voice processing model to generate a target audio signal; The target voice processing model is generated after training model parameters of an initial voice processing model based on an enhanced audio signal, the enhanced audio signal is obtained after voice enhancement processing is carried out on a noise audio signal by adopting the initial voice processing model, the initial voice processing model is used for carrying out double-path compression coding on the noise audio signal to obtain a coding result, and the enhanced audio signal is decoded from the coding result.
16. A method of speech processing, comprising: Acquiring an online conference audio signal to be processed, wherein the online conference audio signal comprises an speaking object audio signal and a background noise audio signal; Performing voice enhancement processing on the online conference audio signal by adopting a target voice processing model, and separating the speaking object audio signal from the online conference audio signal; The target voice processing model is generated after training model parameters of an initial voice processing model based on an enhanced audio signal, the enhanced audio signal is obtained after voice enhancement processing is carried out on a noise audio signal by adopting the initial voice processing model, the initial voice processing model is used for carrying out double-path compression coding on the noise audio signal to obtain a coding result, and the enhanced audio signal is decoded from the coding result.
17. A method of speech processing, comprising: Acquiring a voice processing request through a first application programming interface, wherein request data carried in the voice processing request comprises a voice interaction audio signal; returning a voice processing response through a second application programming interface, wherein response data carried in the voice processing response comprises a target audio signal; The target audio signal is generated after the target audio signal is subjected to voice enhancement processing by adopting a target voice processing model, the target voice processing model is generated after model parameters of an initial voice processing model are trained based on the enhanced audio signal, the enhanced audio signal is obtained after the noise audio signal is subjected to voice enhancement processing by adopting the initial voice processing model, and the initial voice processing model is used for carrying out double-path compression encoding on the noise audio signal to obtain an encoding result and decoding the enhanced audio signal from the encoding result.
18. A method of speech processing, comprising: acquiring a voice processing dialogue request which is input currently, wherein request data carried in the voice processing dialogue request comprises voice interaction audio signals; The voice processing dialogue request is responded, a voice processing dialogue reply is returned, wherein the information carried in the voice processing dialogue reply comprises a target audio signal, the target audio signal is generated after voice enhancement processing is carried out on the voice interaction audio signal by adopting a target voice processing model, the target voice processing model is generated after model parameters of an initial voice processing model are trained based on the enhanced audio signal, the enhanced audio signal is obtained after voice enhancement processing is carried out on a noise audio signal by adopting the initial voice processing model, the initial voice processing model is used for carrying out double-path compression encoding on the noise audio signal to obtain an encoding result, and the enhanced audio signal is decoded from the encoding result; playing the target audio signal within a graphical user interface.
19. An electronic device, comprising: A memory storing an executable program; a processor for executing the program, wherein the program when executed performs the model training method of any one of claims 1 to 14 or the speech processing method of any one of claims 15 to 18.
20. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored executable program, wherein the executable program when run controls a device in which the computer readable storage medium is located to perform the model training method of any one of claims 1 to 14 or the speech processing method of any one of claims 15 to 18.

Description

Model training method, voice processing method, electronic device and storage medium Technical Field The application relates to the technical field of large models and voice processing, in particular to a model training method, a voice processing method, electronic equipment and a storage medium. Background In recent years, with the development of deep learning, end-to-end neural network voice enhancement is becoming a mainstream technology in the field of voice processing. The current voice enhancement model has larger calculated amount and higher calculation cost, so that the wide application of the voice enhancement model in actual scenes is limited. In view of the above problems, no effective solution has been proposed at present. Disclosure of Invention The embodiment of the application provides a model training method, a voice processing method, electronic equipment and a storage medium, which at least solve the technical problems of large calculation amount and high calculation cost of a voice enhancement model in the related technology due to the adoption of a two-way modeling mode. According to one aspect of the embodiment of the application, a model training method is provided, which comprises the steps of obtaining a noise audio signal, carrying out voice enhancement processing on the noise audio signal by adopting an initial voice processing model to obtain an enhanced audio signal, wherein the initial voice processing model is used for carrying out double-path compression coding on the noise audio signal to obtain a coding result and decoding the enhanced audio signal from the coding result, training model parameters of the initial voice processing model based on the enhanced audio signal to generate a target voice processing model, and carrying out voice enhancement processing on a voice interaction audio signal to be processed by the target voice processing model to generate the target audio signal. According to another aspect of the embodiment of the application, a voice processing method is provided, which comprises the steps of obtaining a voice interaction audio signal to be processed, carrying out voice enhancement processing on the voice interaction audio signal by adopting a target voice processing model to generate a target audio signal, wherein the target voice processing model is generated after training model parameters of an initial voice processing model based on the enhanced audio signal, the enhanced audio signal is obtained after carrying out voice enhancement processing on a noise audio signal by adopting the initial voice processing model, and the initial voice processing model is used for carrying out double-path compression encoding on the noise audio signal to obtain an encoding result and decoding the enhanced audio signal from the encoding result. According to another aspect of the embodiment of the application, a voice processing method is provided, which comprises the steps of obtaining an online conference audio signal to be processed, wherein the online conference audio signal comprises a speaking object audio signal and a background noise audio signal, performing voice enhancement processing on the online conference audio signal by using a target voice processing model, separating the speaking object audio signal from the online conference audio signal, generating an initial voice processing model after training model parameters of the initial voice processing model based on an enhanced audio signal, obtaining the noise audio signal after performing voice enhancement processing on the noise audio signal by using the initial voice processing model, performing two-way compression encoding on the noise audio signal by using the initial voice processing model to obtain an encoding result, and decoding the enhanced audio signal from the encoding result. According to another aspect of the embodiment of the application, a voice processing method is further provided, which comprises the steps of obtaining a voice processing request through a first application programming interface, returning voice processing response through a second application programming interface, wherein the response data carried in the voice processing response comprise target audio signals, the target audio signals are generated after voice enhancement processing is carried out on the voice interaction audio signals through a target voice processing model, the target voice processing model is generated after training model parameters of an initial voice processing model based on enhanced audio signals, the enhanced audio signals are obtained after voice enhancement processing is carried out on noise audio signals through an initial voice processing model, and the initial voice processing model is used for carrying out double-path compression encoding on the noise audio signals to obtain encoding results and decoding the enhanced audio signals from the encoding results. According to another aspect of