CN-114708849-B - Speech processing method, device, computer equipment and computer readable storage medium

CN114708849BCN 114708849 BCN114708849 BCN 114708849BCN-114708849-B

Abstract

The embodiment of the application discloses a voice processing method, a device, computer equipment and a computer readable storage medium, wherein a voice synthesis model and a non-parallel voice conversion model can be constructed in advance, a target text is synthesized into middle voices with specified timbres through the voice synthesis model, after user voices of target users are acquired, the specified timbres of the middle voices are directly converted into timbres of the user voices through the parallel voice conversion model to obtain the target synthesis voices, so that voice cloning operation can be rapidly performed, the operation of the users in voice cloning is simple, the operation efficiency of voice cloning can be effectively improved, and a plurality of users can share one voice synthesis model and one non-parallel voice conversion model aiming at the user voices.

Inventors

ZHANG YANG
ZHAN HAOYUE
LIN YUE

Assignees

网易（杭州）网络有限公司

Dates

Publication Date: 20260512
Application Date: 20220427

Claims (12)

1. A method of speech processing, comprising: The method comprises the steps of obtaining language content characteristics and rhythm characteristics from user voice of a target user through a nonparallel voice conversion model, wherein the rhythm characteristics are vectors converted by rhythm characteristic representation according to the user voice through a rhythm characteristic extraction module, the nonparallel voice conversion model is used for generating appointed conversion voice of appointed timbre for training a voice conversion model according to the user voice and appointed timbre information, and the appointed conversion voice of the appointed timbre is voice which has the appointed timbre and has consistent semantic content and rhythm with the user voice; Performing voice conversion processing based on the language content characteristics, the prosody characteristics and the appointed tone information through the non-parallel voice conversion model to obtain appointed conversion voice of appointed tone, wherein the appointed tone information is tone information determined from a plurality of preset tone information; training a voice conversion model according to the user voice and the appointed conversion voice to obtain a target voice conversion model; Inputting a target text of the voice to be synthesized and the appointed tone information into a voice synthesis model to generate an appointed tone intermediate voice; and performing voice conversion processing on the intermediate voice through the target voice conversion model to generate target synthesized voice matched with the tone of the target user.
2. The method according to claim 1, further comprising, before inputting the target text of the speech to be synthesized and the specified tone information into the speech synthesis model to generate the intermediate speech of the specified tone: acquiring sample voice, text of the sample voice and sample tone information; adjusting model parameters of a preset voice model based on the sample voice, the text of the sample voice and the sample tone information to obtain an adjusted preset voice model; And continuously acquiring the next sample voice, the text of the next sample voice and the sample tone information in the training sample voice set, and executing the step of adjusting the model parameters of the preset voice synthesis model based on the sample voice, the text of the sample voice and the sample tone information until the training condition of the adjusted voice model meets the model training ending condition to obtain a trained preset voice model as the voice synthesis model.
3. The method according to claim 1, wherein training the speech conversion model according to the user speech and the specified converted speech to obtain a target speech conversion model comprises: and adjusting model parameters of the parallel voice conversion model based on the user voice and the appointed conversion voice until model training end conditions of the parallel voice conversion model are met, and obtaining a trained parallel voice conversion model serving as a target voice conversion model.
4. The speech processing method according to claim 1, wherein before performing speech conversion processing based on the language content feature, the prosodic feature, and the specified timbre information, obtaining a specified converted speech of a specified timbre, further comprising: Acquiring a training voice pair and preset tone information corresponding to the training voice, wherein the training voice pair comprises original voice and output voice, the original voice and the output voice are the same voice, and all the voices in the training voice pair are voices in a training sample voice set; And adjusting model parameters of the non-parallel voice conversion model based on the original voice, the output voice and the preset tone information until model training end conditions of the non-parallel voice conversion model are met, and obtaining a trained non-parallel voice conversion model as a target non-parallel voice conversion model.
5. The method according to claim 4, wherein said adjusting model parameters of a non-parallel speech conversion model based on the original speech, the preset timbre information, and the output speech, comprises: extracting language content from the original voice through a language characteristic processor of the non-parallel voice conversion model to obtain language content characteristics of the original voice; performing prosody extraction processing on the original voice through a prosody characteristic processor of the non-parallel voice conversion model to obtain prosody characteristics of the original voice; And adjusting model parameters of a non-parallel voice conversion model based on the language content characteristics of the original voice, the rhythm characteristics of the original voice, the preset timbre information and the output voice.
6. The method for processing speech according to claim 4, wherein said language content extraction processing is performed on said original speech by said language feature processor of said non-parallel speech conversion model to obtain language content features of the original speech, comprising: and carrying out language information screening processing on the original voice, determining language information corresponding to the original voice, generating a first appointed length vector based on the language information, and taking the first appointed length vector as a language content feature.
7. The method according to claim 4, wherein the prosody extracting process is performed on the original speech by the prosody feature processor of the non-parallel speech conversion model to obtain prosody features of the original speech, comprising: And performing prosody information screening processing on the original voice, determining prosody information corresponding to the original voice, generating a second specified length vector based on the prosody information, and taking the second specified length vector as prosody characteristics.
8. The speech processing method of claim 4 wherein the obtaining the linguistic content features and prosodic features from the user's speech of the target user comprises: extracting language content from the user voice by a language characteristic processor of the target non-parallel voice conversion model to obtain the language content characteristics of the user voice; and performing prosody extraction processing on the user voice through a prosody characteristic processor of the target non-parallel voice conversion model to obtain prosody characteristics of the user voice.
9. The speech processing method of claim 8 wherein performing speech conversion processing based on the linguistic content features, the prosodic features, and the specified timbre information to obtain a specified converted speech of the specified timbre comprises: inputting the language content characteristics of the user voice, the prosody characteristics of the user voice and the appointed tone information into the target non-parallel voice conversion model to generate appointed conversion voice of appointed tone.
10. A speech processing apparatus, comprising: The device comprises a first acquisition subunit, a second acquisition subunit, a third acquisition subunit and a third acquisition subunit, wherein the first acquisition subunit is used for acquiring language content characteristics and prosody characteristics from user voices of target users through a non-parallel voice conversion model, and the prosody characteristics are vectors which are converted by prosody characteristic representation according to the user voices through a prosody characteristic extraction module; A first processing unit, configured to perform a voice conversion process based on the language content feature, the prosody feature, and specified tone information through the non-parallel voice conversion model, to obtain specified converted voice of a specified tone, where the specified tone information is tone information determined from a plurality of preset tone information, and the specified converted voice is voice having language content and prosody corresponding to the specified tone and the user voice; the training unit is used for training the voice conversion model according to the user voice and the appointed conversion voice to obtain a target voice conversion model; The generation unit is used for inputting the target text of the voice to be synthesized and the appointed tone information into a voice synthesis model to generate the appointed tone intermediate voice; And the second processing unit is used for performing voice conversion processing on the intermediate voice through the target voice conversion model and generating target synthesized voice matched with the tone of the target user.
11. A computer device, characterized in that it comprises a memory in which a computer program is stored and a processor which performs the steps in the speech processing method according to any one of claims 1 to 9 by calling the computer program stored in the memory.
12. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program adapted to be loaded by a processor for performing the steps in the speech processing method according to any of claims 1 to 9.

Description

Speech processing method, device, computer equipment and computer readable storage medium Technical Field The embodiment of the application relates to the technical field of information processing, in particular to a voice processing method, a voice processing device, computer equipment and a computer readable storage medium. Background With the continuous development of information technology, a great deal of popular application of computer devices such as smart phones, tablet computers and notebook computers is developed towards diversification and individuation, and the computer devices can synthesize voices of people who are comparable to each other, so that human-computer interaction experience is enriched, for example, the current common voice processing technology comprises voice synthesis, voice conversion, voice cloning and other technologies. Sound cloning refers to a technique in which a machine extracts tone color information from a voice provided by a user and synthesizes the voice using the user tone color. Voice cloning is an extension of speech synthesis technology, where traditional speech synthesis is the conversion of text to speech on a fixed speaker, and voice cloning further specifies the speaker's timbre. At present, sound cloning has a plurality of practical scenes, such as applications of voice navigation, voiced novels and the like, a user can customize a voice package by uploading voice, and the user uses the voice to navigate or read the novels so as to promote the interestingness of using application programs. In the prior art, when a user performs personalized customization by using a voice cloning technology, a section of own voice and a text corresponding to the voice are generally required to be provided to realize voice cloning. However, in the usage scenario of voice cloning, there may be a case where the recorded voice provided by the user is inconsistent with the read-out content of the voice, which results in a need for performing a cleaning correction operation before performing the voice model training. Therefore, the recorded sound consistent with the read-aloud content is difficult to obtain, and the requirement on the user is high when the voice recording is carried out, so that the user experience is affected. Disclosure of Invention The embodiment of the application provides a voice processing method, a device, computer equipment and a computer readable storage medium, which are used for directly converting the appointed tone of target user into the tone of user voice after acquiring the user voice of the target user by synthesizing the target text into the appointed tone, so as to obtain target synthesized voice, thereby being capable of quickly performing voice cloning operation, leading the user to have simple operation when performing voice cloning and effectively improving the operation efficiency of voice cloning, and being capable of simplifying the voice conversion model structure, leading the voice conversion model to be light, and reducing the storage consumption of the voice conversion model on the computer equipment. The embodiment of the application provides a voice processing method, which comprises the following steps: Performing voice conversion processing based on user voice of a target user and designated tone information to obtain designated conversion voice of a designated tone, wherein the designated tone information is tone information determined from a plurality of pieces of preset tone information, and the designated conversion voice is user voice with the designated tone; training a voice conversion model according to the user voice and the appointed conversion voice to obtain a target voice conversion model; Inputting a target text of the voice to be synthesized and the appointed tone information into a voice synthesis model to generate an appointed tone intermediate voice; and performing voice conversion processing on the intermediate voice through the target voice conversion model to generate target synthesized voice matched with the tone of the target user. Correspondingly, the embodiment of the application also provides a voice processing device, which comprises: The first processing unit is used for performing voice conversion processing based on user voice of a target user and designated tone information to obtain designated conversion voice of a designated tone, wherein the designated tone information is tone information determined from a plurality of pieces of preset tone information, and the designated conversion voice is user voice with the designated tone; the training unit is used for training the voice conversion model according to the user voice and the appointed conversion voice to obtain a target voice conversion model; The generation unit is used for inputting the target text of the voice to be synthesized and the appointed tone information into a voice synthesis model to generate the appointed tone intermediate voice; A