CN-121983028-A - Voice generation method and device and electronic equipment

CN121983028ACN 121983028 ACN121983028 ACN 121983028ACN-121983028-A

Abstract

The embodiment of the invention discloses a voice generation method, a voice generation device and electronic equipment, which are used for determining a text to be converted according to description information through a large language model, generating an intermediate phoneme sequence corresponding to the text to be converted and intermediate voice information corresponding to the intermediate phoneme sequence, analyzing the intermediate voice information by utilizing an antagonism classifier and mutual information minimization to obtain a corresponding emotion feature vector and a voice feature vector, determining a parameterization curve of the sentence head and/or the sentence tail of at least one sentence in the intermediate voice information based on at least one parameter of fundamental frequency, duration and energy, and finally combining the parameterization curve, the intermediate phoneme sequence, the emotion feature vector and the voice feature vector to generate target voice information. Therefore, the voice rhythm expression can be optimized, the separation of characteristic dimensions is realized, and the integral output effect of the synthesized voice is improved.

Inventors

TIAN ZHEN
ZHANG JIAN

Assignees

北京比特易湃信息技术有限公司

Dates

Publication Date: 20260505
Application Date: 20260121

Claims (10)

1. A method of speech generation, the method comprising: Determining a text to be converted according to the description information through a large language model; determining an intermediate phoneme sequence corresponding to the text to be converted and intermediate voice information corresponding to the intermediate phoneme sequence; determining emotion feature vectors and tone feature vectors corresponding to the intermediate voice information through the antagonism classifier and mutual information minimization constraint; determining a parameterized curve corresponding to the sentence head and/or the sentence tail of at least one sentence of the intermediate voice information according to at least one of fundamental frequency, duration and energy; And determining target voice information according to the parameterized curve, the intermediate phoneme sequence, the emotion feature vector and the tone feature vector.
2. The method according to claim 1, wherein the method further comprises: determining a first text corresponding to the intermediate voice information through an automatic voice recognition model; Determining the matching degree of the first text and a preset glossary; And responding to the matching degree being smaller than a preset threshold value, and adjusting the text to be converted according to the preset glossary to obtain the adjusted text to be converted.
3. The method of claim 2, wherein after adjusting the text to be converted, the method further comprises: and determining the intermediate phoneme sequence corresponding to the adjusted text to be converted and the intermediate voice information corresponding to the intermediate phoneme sequence.
4. The method according to claim 1, wherein the method further comprises: And determining a detection result of the target voice information, wherein the detection result comprises at least one of a term accuracy rate, a mean opinion score, a real-time factor and a loudness unit full scale value.
5. The method according to claim 4, wherein the method further comprises: transmitting the target voice information to a target terminal in response to the detection result not meeting a predetermined condition; and responding to the received voice information sent by the target terminal, and taking the voice information sent by the target terminal as target voice information.
6. The method of claim 5, wherein the method further comprises: and adjusting a domain dictionary and/or a large language model prompt library according to the voice information sent by the target terminal.
7. The method of claim 1, wherein said determining target speech information from the parametric curve, intermediate phoneme sequence, emotion feature vector and timbre feature vector comprises: Determining a mel spectrum sequence and alignment information according to the parameterized curve, the intermediate phoneme sequence, the emotion feature vector and the tone feature vector; determining the target voice information according to the Mel frequency spectrum sequence and the alignment information; The alignment information is used for representing the time sequence corresponding relation between the phoneme sequence and the Mel frequency spectrum sequence.
8. The method of claim 1, wherein the determining the emotion feature vector and the tone feature vector corresponding to the intermediary speech information by the challenge classifier and mutual information minimization constraint comprises: determining a hybrid feature vector by the feature extractor and the shared encoder; Determining, by a branch encoder, a content feature vector, an emotion feature vector, and a tone feature vector from the hybrid feature vector; The purity of the emotion feature vector and the tone feature vector is improved by antagonizing the classifier and mutual information minimization constraints.
9. A speech generating apparatus, the apparatus comprising: The text determining unit is used for determining a text to be converted according to the description information through the large language model; the intermediate information determining unit is used for determining an intermediate phoneme sequence corresponding to the text to be converted and intermediate voice information corresponding to the intermediate phoneme sequence; the decoupling unit is used for determining emotion feature vectors and tone feature vectors corresponding to the intermediate voice information through the countermeasure classifier and mutual information minimization constraint; the parameter curve determining unit is used for determining a parameterized curve corresponding to the sentence head and/or the sentence tail of at least one sentence of the intermediate voice information according to at least one of fundamental frequency, duration and energy; And the target voice generating unit is used for determining target voice information according to the parameterization curve, the middle phoneme sequence, the emotion feature vector and the tone feature vector.
10. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-8.

Description

Voice generation method and device and electronic equipment Technical Field The present invention relates to the field of speech synthesis technologies, and in particular, to a speech generating method, apparatus, and electronic device. Background In relevant application scenarios of voice generation such as information broadcasting and responsive voice, the technical scheme widely adopted at present is to use general voice synthesis as a core and to assist in the mode of manual writing and voice repairing to generate voice. Specifically, a text is generated by manual writing or a large language model, then the text is input into a voice synthesis system for voice synthesis, wherein a single emotion label is used in the synthesis process, and finally whether the synthesized voice has a term problem and a style problem or not is judged in a manual mode so as to ensure the voice output quality. Besides, the prior art also provides an end-to-end technical scheme combining voice cloning and style migration, and a few samples of a target speaker are utilized to complete tone cloning through a speaker embedder, a style encoder and an end-to-end acoustic-vocoder model, and emotion style codes are overlapped in the same latent space to generate voice. However, the prior art speech generation methods have a number of drawbacks. On the one hand, the speech generation method in the prior art does not include prosodic modulation for the beginning and end of the sentence, resulting in the inability of the speech's expression intent to pass through prosody delivery. On the other hand, the speech generation method in the prior art cannot decouple emotion and tone of speech, resulting in speech. In summary, the voice generated by the voice generating method in the prior art has the defects of unnatural hearing feeling and poor flexibility of personalized voice generation. Disclosure of Invention In view of this, embodiments of the present invention provide a method, an apparatus, an electronic device, and a storage medium for generating speech, which can optimize prosody expression of speech, realize separation of feature dimensions, and improve overall output effect of synthesized speech. In a first aspect, an embodiment of the present invention provides a method for generating speech, where the method includes: Determining a text to be converted according to the description information through a large language model; determining an intermediate phoneme sequence corresponding to the text to be converted and intermediate voice information corresponding to the intermediate phoneme sequence; Determining emotion feature vectors and tone feature vectors corresponding to the intermediate voice information through an countermeasure classifier; Determining a parameterized curve of the sentence head and/or the sentence tail of at least one sentence of the intermediate voice information according to at least one of fundamental frequency, duration and energy; And determining target voice information according to the parameterized curve, the intermediate phoneme sequence, the emotion feature vector and the tone feature vector. In a second aspect, an embodiment of the present invention provides a speech generating apparatus, including: The text determining unit is used for determining a text to be converted according to the description information through the large language model; the intermediate information determining unit is used for determining an intermediate phoneme sequence corresponding to the text to be converted and intermediate voice information corresponding to the intermediate phoneme sequence; The decoupling unit is used for determining emotion feature vectors and tone feature vectors corresponding to the intermediate voice information through the countermeasure classifier; a parameterized curve determining unit, configured to determine a parameterized curve of a sentence head and/or a sentence tail of at least one sentence of the intermediate speech information according to at least one of a fundamental frequency, a duration and an energy; And the target voice generating unit is used for determining target voice information according to the parameterization curve, the middle phoneme sequence, the emotion feature vector and the tone feature vector. In a third aspect, embodiments of the present invention provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement a method according to the first aspect. In a fourth aspect, an embodiment of the present invention provides an electronic device comprising a memory and a processor, the memory storing one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method as described in the first aspect. According to the technical scheme, a text to be converted is determined according to description information through