CN-121999757-A - Speech synthesis method, device, equipment and medium

CN121999757ACN 121999757 ACN121999757 ACN 121999757ACN-121999757-A

Abstract

The invention relates to the technical field of data processing and discloses a voice synthesis method, device, equipment and medium, which comprise the steps of obtaining target prompt audio and target text to be subjected to voice synthesis, determining the language type of the target text for voice synthesis, counting the total number of language units which are contained in the target text and correspond to the language type, executing the target prompt audio in the language units to obtain corresponding pronunciation rate, determining the total duration of the target voice obtained by voice synthesis of the target text according to the total number and the pronunciation rate, inputting the total duration, the target prompt audio and the target text into a target voice synthesis model, and outputting a voice synthesis result corresponding to the target voice. The invention can be applied to the field of financial science and technology, and improves the accuracy of speech synthesis under the audio scene without transcription prompts.

Inventors

SHI YAN
CHEN MINCHUAN

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260508
Application Date: 20260108

Claims (10)

1. A method of speech synthesis, comprising: acquiring target prompt audio and target text to be subjected to voice synthesis, determining a language category for voice synthesis of the target text, and counting the total number of language units contained in the target text and corresponding to the language category; Executing the target prompt audio by the language unit to obtain a corresponding pronunciation rate, and determining the total duration of target voice obtained by voice synthesis of the target text according to the total number and the pronunciation rate; and inputting the total duration, the target prompt audio and the target text into a target voice synthesis model, and outputting a voice synthesis result corresponding to the target voice.
2. The method according to claim 1, wherein said counting the total number of language units contained in the target text and corresponding to the language category includes: If the language category is Chinese, determining that the language unit corresponding to the language category is syllable, and counting the total number of syllables contained in the target text; If the language category is English, determining that the language unit corresponding to the language category is phonemes, and counting the total number of the phonemes contained in the target text.
3. The method of claim 1, wherein said executing said target alert audio in said linguistic units results in a corresponding pronunciation rate, comprising: Inputting the target prompt audio into an encoder of a target rate predictor, and extracting features of the target prompt audio through the encoder to obtain audio features; And inputting the audio characteristics and the language units into a decoder of the target rate predictor, decoding the audio characteristics according to the language units through the decoder, and outputting the pronunciation rate.
4. The method of claim 1, wherein inputting the total duration, the target alert audio, and the target text into the target speech synthesis model, and outputting a speech synthesis result corresponding to the target speech, comprises: Inputting the total duration, the target prompt audio and the target text into a flow matching synthesis module of the target voice synthesis model, and obtaining a Mel spectrogram corresponding to the target voice according to the total duration, the target prompt audio and the target text through the flow matching synthesis module; And inputting the Mel spectrogram into a vocoder of the target voice synthesis model, performing voice synthesis according to the Mel spectrogram through the vocoder, and outputting a voice synthesis result corresponding to the target voice.
5. The method of speech synthesis according to claim 1, wherein the method of speech synthesis further comprises: acquiring sample prompt audio and a transcribed text of the corresponding sample prompt audio; for any sample prompting audio, according to a preset dividing point, dividing the sample prompting audio to obtain a first sample prompting audio segment positioned in front of the dividing point and a second sample prompting audio segment positioned behind the dividing point; determining the duration of the second sample prompting audio segment, taking the second sample prompting audio segment as a label, and associating the second sample prompting audio segment, the transcribed text of the second sample prompting audio segment and the duration with each other to form a first training sample; And training the initial target voice synthesis model according to the first training sample to obtain the target voice synthesis model.
6. The method according to claim 5, wherein after the slicing the sample presentation audio according to the preset slicing point, obtaining a first sample presentation audio piece located before the slicing point and a second sample presentation audio piece located after the slicing point, further comprises: Determining word boundary information of the second sample prompting audio segment according to the transcribed text of the second sample prompting audio segment, wherein the word boundary information comprises a corresponding time stamp of each word in the transcribed text of the second sample prompting audio segment in the second sample prompting audio segment; The second sample prompting audio segment is used as a label, and a second training sample is formed by associating the second sample prompting audio segment with the first sample prompting audio segment, the transcribed text of the second sample prompting audio segment, the duration and word boundary information; and training the initial target voice synthesis model according to the second training sample to obtain the target voice synthesis model.
7. The method of claim 1, wherein determining a total duration of the target speech obtained by speech synthesis of the target text according to the total number and the pronunciation rate comprises: And dividing the total number and the pronunciation rate to obtain the total duration.
8. A speech synthesis apparatus, comprising: The first acquisition module is used for acquiring target prompt audio and target text to be subjected to voice synthesis, determining a language category for performing voice synthesis on the target text, and counting the total number of language units which are contained in the target text and correspond to the language category; The duration prediction module is used for executing the target prompt audio in the language unit to obtain a corresponding pronunciation rate, and determining the total duration of target voice obtained by performing voice synthesis on the target text according to the total number and the pronunciation rate; And the voice synthesis module is used for inputting the total duration, the target prompt audio and the target text into a target voice synthesis model and outputting a voice synthesis result corresponding to the target voice.
9.A computer device, characterized in that it comprises a processor, a memory and a computer program stored in the memory and executable on the processor, which processor implements the speech synthesis method according to any of claims 1 to 7 when executing the computer program.
10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method of speech synthesis according to any one of claims 1 to 7 when the processor executes the computer program.

Description

Speech synthesis method, device, equipment and medium Technical Field The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a medium for speech synthesis. Background In recent years, the application of the voice synthesis technology in the fields of insurance and financial science and technology is increasingly in progress, and from intelligent customer service, product voice description and compliance phone return visit to personalized financial consultant interaction, high-quality voice interaction has become a key link for improving customer experience, optimizing operation efficiency and enhancing service reliability. The non-autoregressive and stream-matching-based speech synthesis model is regarded as a key technology for promoting industrial speech application to land on a large scale due to high-quality speech output and efficient reasoning speed. These techniques greatly simplify the lengthy processing pipeline in traditional speech synthesis by applying probabilistic stream learning to the noise-to-speech generation process, providing the potential for real-time, natural speech interactions. However, the existing advanced stream matching zero-sample speech cloning system faces a serious challenge in practical service deployment, and the core bottleneck is that text transcription corresponding to audio prompts is strongly dependent. The method is characterized in that when the system imitates the voice style of a section of customer or customer service representative, the system must know the complete text content of the section of example audio in advance, so that the synthesis time length of new content can be calculated through the text length proportion, and once the situation of no-transcription prompt audio is encountered, the time length estimation based on the text length is not reliable any more, so that the synthesized voice is compressed or stretched in time sequence, and the intelligibility and the naturalness are seriously affected. Another real problem is the burden of data and labels, that large-scale multilingual training sets do not have high quality transcription for every audio, and that transcription-dependent systems are limited in practical deployment, especially for low-resource or unseen languages. Therefore, how to improve the accuracy of speech synthesis in the scene without transcription hint audio is a problem to be solved. Disclosure of Invention The embodiment of the invention provides a voice synthesis method, a device, equipment and a medium, which are used for solving the problem of how to improve the accuracy of voice synthesis under the condition of no transcription prompt audio. In a first aspect, a method of speech synthesis includes: acquiring target prompt audio and target text to be subjected to voice synthesis, determining a language category for voice synthesis of the target text, and counting the total number of language units contained in the target text and corresponding to the language category; Executing the target prompt audio by the language unit to obtain a corresponding pronunciation rate, and determining the total duration of target voice obtained by voice synthesis of the target text according to the total number and the pronunciation rate; and inputting the total duration, the target prompt audio and the target text into a target voice synthesis model, and outputting a voice synthesis result corresponding to the target voice. In a second aspect, there is provided a speech synthesis apparatus comprising: The first acquisition module is used for acquiring target prompt audio and target text to be subjected to voice synthesis, determining a language category for performing voice synthesis on the target text, and counting the total number of language units which are contained in the target text and correspond to the language category; The duration prediction module is used for executing the target prompt audio in the language unit to obtain a corresponding pronunciation rate, and determining the total duration of target voice obtained by performing voice synthesis on the target text according to the total number and the pronunciation rate; And the voice synthesis module is used for inputting the total duration, the target prompt audio and the target text into a target voice synthesis model and outputting a voice synthesis result corresponding to the target voice. In a third aspect, a computer device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the above-mentioned speech synthesis method when executing the computer program. In a fourth aspect, a computer readable storage medium is provided, the computer readable storage medium storing a computer program, which when executed by a processor, implements the above-described speech synthesis method. According to the voice