CN-119763548-B - Training method of voice processing model, voice processing method, device and equipment

CN119763548BCN 119763548 BCN119763548 BCN 119763548BCN-119763548-B

Abstract

The disclosure provides a training method of a voice processing model, a voice processing method, a device and equipment, and belongs to the technical field of computers. The method comprises the steps of carrying out voice coding on a sample voice signal to obtain a semantic embedded representation and an acoustic embedded representation of the sample voice signal, carrying out phoneme extraction and phoneme coding on a reference voice text of the sample voice signal to obtain a phoneme embedded representation of the reference voice text, and training a voice processing model based on the semantic embedded representation, the acoustic embedded representation and the phoneme embedded representation, wherein the voice processing model is used for carrying out real-time voice synthesis on input voice. According to the method, semantic information and acoustic information are introduced in the model training process, so that the model can learn cleaner semantic information, more language information can be reserved when the voice is processed, and the naturalness of the synthesized voice is improved.

Inventors

Qiang Chunyu
ZHANG CHEN

Assignees

北京达佳互联信息技术有限公司

Dates

Publication Date: 20260512
Application Date: 20241220

Claims (12)

1. A method of training a speech processing model, the method comprising: Performing speech coding on a sample speech signal to obtain a semantic embedded representation and an acoustic embedded representation of the sample speech signal, wherein the sample speech signal contains contents, intonation and tone of a speaker, the semantic embedded representation is a digital representation of semantic information contained in the sample speech signal, and the acoustic embedded representation comprises information on frequency, amplitude, tone and prosody of sound; Performing phoneme extraction and phoneme coding on a reference voice text of the sample voice signal to obtain a phoneme embedded representation of the reference voice text; Determining a first penalty based on the semantic embedded representation and the phoneme embedded representation, the first penalty being a similarity penalty between the semantic embedded representation and the phoneme embedded representation; determining a predicted mel-spectrogram based on the semantic embedded representation and the acoustic embedded representation; determining a second loss based on the predicted mel-pattern and a mel-pattern of the sample speech signal, the second loss being a mean square error between the predicted mel-pattern and the mel-pattern of the sample speech signal; Based on the first loss and the second loss, a speech processing model is trained for real-time speech synthesis of the input speech.
2. The method of claim 1, wherein said speech encoding the sample speech signal to obtain a semantic embedded representation and an acoustic embedded representation of the sample speech signal comprises: extracting a mel-pattern of the sample speech signal; performing voice coding on the Mel spectrogram to obtain middle voice characteristics; performing semantic embedding on the intermediate voice features to obtain the semantic embedded representation; and carrying out acoustic embedding on the intermediate voice feature to obtain the acoustic embedded representation.
3. The method of claim 1, wherein said phonetically extracting and phoneme encoding a reference speech text of said sample speech signal to obtain a phonetically embedded representation of said reference speech text, comprises: Extracting phonemes from a reference voice text of the sample voice signal to obtain a phoneme sequence corresponding to the reference voice text; and carrying out phoneme coding on the phoneme sequence to obtain a phoneme embedded representation of the reference voice text.
4. A method of training a speech processing model according to claim 3, the method further comprising: normalizing the length of the phoneme sequence to obtain the phoneme sequence with the target length.
5. The method of training a speech processing model of claim 1 wherein the determining a predicted mel-spectrum based on the semantic embedded representation and the acoustic embedded representation comprises: Summing the semantic embedded representation and the acoustic embedded representation to obtain an intermediate embedded representation; and decoding the intermediate embedded representation to obtain the predicted mel spectrogram.
6. The method of training a speech processing model according to any one of claims 1-5, further comprising: And aligning the time dimension of the sample voice signal and the reference voice text so as to enable the morphemes in the reference voice text to be consistent with the lengths of the corresponding voice fragments in the sample voice signal.
7. A method of speech processing, the method comprising: performing voice coding on a mel spectrogram of a voice signal to be processed based on a voice processing model to obtain intermediate voice characteristics, wherein the voice processing model is obtained based on any one of the claims 1-6; Performing semantic embedding on the intermediate voice features based on the voice processing model to obtain semantic embedded representation of the voice signal; performing acoustic embedding on the intermediate voice features based on the voice processing model to obtain an acoustic embedded representation of the voice signal; Processing the semantic embedded representation and the acoustic embedding based on the speech processing model to obtain a predicted mel spectrogram; And decoding the predicted Mel spectrogram based on the voice processing model to obtain a voice processing result.
8. A training device for a speech processing model, the device comprising: A speech encoding unit configured to speech encode a sample speech signal, the sample speech signal comprising content, intonation and timbre of a speaker, to obtain a semantic embedded representation and an acoustic embedded representation of the sample speech signal, the semantic embedded representation being a digitized representation of semantic information contained in the sample speech signal, the acoustic embedded representation comprising information in terms of frequency, amplitude, timbre and prosody of sound; a phoneme encoding unit configured to perform phoneme extraction and phoneme encoding on a reference speech text of the sample speech signal to obtain a phoneme embedded representation of the reference speech text; The speech processing device comprises a training unit configured to determine a first loss based on the semantic embedded representation and the phoneme embedded representation, the first loss being a similarity loss between the semantic embedded representation and the phoneme embedded representation, determine a predicted mel-profile based on the semantic embedded representation and the acoustical embedded representation, determine a second loss based on the predicted mel-profile and a mel-profile of the sample speech signal, the second loss being a mean square error between the predicted mel-profile and the mel-profile of the sample speech signal, and train a speech processing model based on the first loss and the second loss, the speech processing model being for real-time speech synthesis of input speech.
9. A speech processing apparatus, the apparatus comprising: a voice coding unit configured to perform voice coding on a mel spectrogram of a voice signal to be processed based on a voice processing model, to obtain intermediate voice characteristics, the voice processing model being obtained by training according to any one of claims 1 to 6; the semantic embedding unit is configured to perform semantic embedding on the intermediate voice features based on the voice processing model to obtain semantic embedded representation of the voice signal; The acoustic embedding unit is configured to perform acoustic embedding on the middle voice characteristic based on the voice processing model to obtain an acoustic embedded representation of the voice signal; A speech decoding unit configured to process the semantic embedded representation and the acoustic embedding based on the speech processing model, resulting in a predicted mel-spectrogram; the voice decoding unit is further configured to decode the predicted mel spectrogram based on the voice processing model to obtain a voice processing result.
10. An electronic device, the electronic device comprising: one or more processors; A memory for storing the processor-executable program code; Wherein the processor is configured to execute the program code to implement the training method of the speech processing model of any one of claims 1 to 6 or to implement the speech processing method of claim 7.
11. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the training method of a speech processing model according to any one of claims 1 to 6 or to perform the speech processing method according to claim 7.
12. A computer program product comprising a computer program which, when executed by a processor, implements a method of training a speech processing model according to any one of claims 1 to 6 or implements a method of speech processing according to claim 7.

Description

Training method of voice processing model, voice processing method, device and equipment Technical Field The present application relates to the field of computer technologies, and in particular, to a training method for a speech processing model, a speech processing method, a device, and equipment. Background The voice dialogue system is a technical system capable of realizing man-machine voice interaction. The speech processing scheme is a core part in a speech dialog system. Conventional speech processing schemes typically employ a concatenation of automatic speech recognition, large language models, and text-to-speech conversion, with text as an intermediate modality to implement speech processing. However, in the above scheme, text is used as an intermediate mode, and the auxiliary language information in the voice, such as emotion, mood and the like, cannot be reserved, so that the naturalness of the synthesized voice is not high. Disclosure of Invention The disclosure provides a training method of a voice processing model, a voice processing method, a device and equipment. According to the method, semantic information and acoustic information are introduced in the model training process, so that the model can learn cleaner semantic information, more language information can be reserved when the voice is processed, and the naturalness of the synthesized voice is improved. According to an aspect of an embodiment of the present disclosure, there is provided a training method of a speech processing model, the method including: performing voice coding on a sample voice signal to obtain semantic embedded representation and acoustic embedded representation of the sample voice signal; Performing phoneme extraction and phoneme coding on a reference voice text of the sample voice signal to obtain a phoneme embedded representation of the reference voice text; Based on the semantic embedded representation, the acoustic embedded representation, and the phoneme embedded representation, a speech processing model is trained for real-time speech synthesis of the input speech. According to another aspect of the embodiments of the present disclosure, there is provided a voice processing method, the method including: Performing voice coding on a mel spectrogram of a voice signal to be processed based on a voice processing model to obtain middle voice characteristics, wherein the voice processing model is obtained by training based on the training method of the voice processing model; Performing semantic embedding on the intermediate voice features based on the voice processing model to obtain semantic embedded representation of the voice signal; performing acoustic embedding on the intermediate voice features based on the voice processing model to obtain an acoustic embedded representation of the voice signal; Processing the semantic embedded representation and the acoustic embedding based on the speech processing model to obtain a predicted mel spectrogram; And decoding the predicted Mel spectrogram based on the voice processing model to obtain a voice processing result. According to another aspect of the embodiments of the present disclosure, there is provided a training apparatus of a speech processing model, the apparatus including: A speech encoding unit configured to speech encode a sample speech signal, resulting in a semantic embedded representation and an acoustic embedded representation of the sample speech signal; a phoneme encoding unit configured to perform phoneme extraction and phoneme encoding on a reference speech text of the sample speech signal to obtain a phoneme embedded representation of the reference speech text; A training unit configured to train a speech processing model for real-time speech synthesis of the input speech based on the semantic embedded representation, the acoustic embedded representation and the phoneme embedded representation. In some embodiments, the speech coding unit is configured to extract a mel-spectrogram of the sample speech signal, perform speech coding on the mel-spectrogram to obtain intermediate speech features, perform semantic embedding on the intermediate speech features to obtain the semantic embedded representation, and perform acoustic embedding on the intermediate speech features to obtain the acoustic embedded representation. In some embodiments, the phoneme encoding unit is configured to perform phoneme extraction on a reference voice text of the sample voice signal to obtain a phoneme sequence corresponding to the reference voice text, and perform phoneme encoding on the phoneme sequence to obtain a phoneme embedded representation of the reference voice text. In some embodiments, the phoneme encoding unit is further configured to normalize the length of the phoneme sequence to obtain the phoneme sequence of the target length. In some embodiments, the training unit is configured to determine a first loss based on the semantic embedded representation and the