CN-122024692-A - Text-to-speech method, device, equipment and medium based on multiple rewarding mechanism

CN122024692ACN 122024692 ACN122024692 ACN 122024692ACN-122024692-A

Abstract

The invention relates to the technical field of artificial intelligence, and provides a text-to-speech method, a device, equipment and a medium based on a multiple rewarding mechanism, which are applied to financial and medical health care business scenes, can input a plurality of text samples into an inference type large language model to obtain a prosodic template and provide clear supervision signals for prosodic alignment rewards, generate candidate speech samples by utilizing a single codebook speech generation model, avoid the influence of the randomness of the single samples on model optimization, construct semantic consistency rewards, voiceprint consistency rewards, voice length control rewards and prosodic alignment rewards, realize multidimensional constraint, construct relative advantages in a group according to the semantic consistency rewards, voiceprint consistency rewards, voice length control rewards and prosodic alignment rewards, and combine the relative optimization algorithm update model in the group to realize cooperative optimization of multiple rewards, radically improve model performance from a generation strategy layer instead of only carrying out surface parameter fine adjustment, thereby assisting in generating higher-quality audio.

Inventors

SHI JIN
CHEN MINCHUAN

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260512
Application Date: 20260210

Claims (10)

1. The text-to-speech method based on the multiple rewarding mechanism is characterized by comprising the following steps: Constructing a training sample, wherein the training sample comprises an initial text of a voice to be generated and a reference audio; Preprocessing the initial text and the reference audio to obtain a plurality of text samples corresponding to the initial text and audio samples corresponding to the reference audio; inputting the plurality of text samples into an inference type large language model to obtain a prosody template; Inputting the plurality of text samples into a single codebook speech generation model to obtain a set of candidate speech samples corresponding to each text sample; utilizing each candidate voice sample and the audio sample to construct semantic consistency rewards, voiceprint consistency rewards, voice length control rewards and rhythm alignment rewards; For each candidate sample in the same group, constructing a group relative advantage according to the semantic consistency rewards, the voiceprint consistency rewards, the voice length control rewards and the prosody alignment rewards; Adopting a relative strategy optimization algorithm in a group, taking the relative advantages in the group as a guide signal, taking the expectation of maximizing the relative advantages in the group as an optimization target, and updating parameters of the single codebook speech generation model through a back propagation mechanism to obtain a target model; And responding to a text-to-speech instruction triggered based on a target text and a target reference audio, and processing the target text and the target reference audio by using the target model to obtain target speech.
2. The multiple rewards mechanism-based text-to-speech method of claim 1 wherein, prior to said inputting said text sample into an inference large language model, said method further comprises: Acquiring a pre-training large language model and training data with prosodic feature labels; Adopting an countermeasure type self-supervision learning mechanism, and utilizing the training data to finely tune the pre-training large language model; Stopping training when the matching degree of the output data of the pre-training large language model and the rhythm features marked in the training data is greater than a preset threshold value, and determining the model obtained by current training as the inference large language model; the inference type large language model is used for analyzing text grammar structures and semantic logic of the text samples based on a prosody knowledge base and generating the prosody templates containing pause positions and pause duration grades.
3. The multiple rewards mechanism-based text-to-speech method of claim 1 wherein said utilizing each candidate speech sample and said audio sample to construct a semantic consistency reward, a voiceprint consistency reward, a speech length control reward, and a prosody alignment reward comprises: for each candidate voice sample, calling a voice recognition model to convert the candidate voice sample into a text to obtain a transcription text; calculating the editing distance between the transfer text and the corresponding text sample; Acquiring the total length of characters corresponding to the text sample; calculating the quotient of the editing distance and the total length of the characters to obtain a first intermediate value; and calculating the difference between the 1 and the first intermediate value to obtain the semantic consistency rewards corresponding to the candidate voice samples.
4. The multiple rewards mechanism-based text-to-speech method of claim 1 wherein said utilizing each candidate speech sample and said audio sample to construct a semantic consistency reward, a voiceprint consistency reward, a speech length control reward, and a prosody alignment reward further comprises: For each candidate voice sample, invoking a voiceprint extraction model to extract the audio sample and a high-dimensional voiceprint vector of the candidate voice sample, and obtaining a first voiceprint vector corresponding to the audio sample and a second voiceprint vector corresponding to the candidate voice sample; Calculating the product of the first voiceprint vector and the second voiceprint vector to obtain a first numerical value; calculating a first L2 norm of the first voiceprint vector and calculating a second L2 norm of the second voiceprint vector; Calculating the product of the first L2 norm and the second L2 norm to obtain a second value; and calculating the quotient of the first value and the second value to obtain the voiceprint consistency rewards corresponding to the candidate voice samples.
5. The multiple rewards mechanism-based text-to-speech method of claim 3 wherein said utilizing each candidate speech sample and said audio sample to construct a semantic consistency reward, a voiceprint consistency reward, a speech length control reward, and a prosody alignment reward further comprises: for each candidate voice sample, acquiring a reference audio speech speed and an allowable error range; calculating the quotient of the corresponding text sample and the reference audio speech rate to obtain estimated speech duration; Calculating the actual duration of the candidate voice sample; generating a voice duration range according to the actual duration and the allowable error range; When the estimated voice duration is within the voice duration range, determining that the corresponding voice length control reward is 1, or When the estimated voice duration is not in the voice duration range, calculating a difference value between the actual duration and the estimated voice duration, calculating a quotient of the difference value and the estimated voice duration as a second intermediate value, and calculating a difference between 1 and the second intermediate value as a corresponding voice length control reward.
6. The multiple rewards mechanism-based text-to-speech method of claim 1 wherein said utilizing each candidate speech sample and said audio sample to construct a semantic consistency reward, a voiceprint consistency reward, a speech length control reward, and a prosody alignment reward further comprises: For each candidate voice sample, carrying out pause extraction on the candidate voice sample, and identifying pause positions and pause time in the candidate voice sample by adopting an end point detection algorithm so as to generate an actual pause sequence; Obtaining a standard pause sequence in the prosody template and obtaining the number of pause positions in the standard pause sequence; Obtaining the matching quantity of the pause positions in the actual pause sequence and the standard pause sequence; And calculating the quotient of the matching quantity and the pause position quantity in the standard pause sequence to obtain prosody alignment rewards corresponding to the candidate voice samples.
7. The multiple rewards mechanism based text-to-speech method of claim 1 wherein said establishing a group of relative advantages in accordance with said semantic consistency rewards, said voiceprint consistency rewards, said voice length control rewards, and said prosody alignment rewards comprises: calculating the sum of the rewards of the semantic consistency rewards, the voiceprint consistency rewards, the voice length control rewards and the prosody alignment rewards corresponding to each candidate sample in the group, and calculating the average value of the sum of the rewards of the semantic consistency rewards, the voiceprint consistency rewards, the voice length control rewards and the prosody alignment rewards corresponding to each candidate sample in the group; And calculating the difference between each reward sum and the average value to perform average value removing processing on the reward sum corresponding to each candidate sample in the group, thereby obtaining the group relative advantage.
8. A multiple rewarding mechanism based text-to-speech apparatus, the multiple rewarding mechanism based text-to-speech apparatus comprising: the system comprises a construction unit, a training unit and a processing unit, wherein the construction unit is used for constructing a training sample, and the training sample comprises an initial text of a voice to be generated and a reference audio; The preprocessing unit is used for preprocessing the initial text and the reference audio to obtain a plurality of text samples corresponding to the initial text and audio samples corresponding to the reference audio; the input unit is used for inputting the text samples into the inference type large language model to obtain a prosody template; The input unit is further configured to input the plurality of text samples into a single codebook speech generation model, and obtain a set of candidate speech samples corresponding to each text sample; the construction unit is further used for constructing semantic consistency rewards, voiceprint consistency rewards, voice length control rewards and prosody alignment rewards by utilizing each candidate voice sample and the audio sample; The construction unit is further configured to construct, for each candidate sample in the same group, a group relative advantage according to the semantic consistency rewards, the voiceprint consistency rewards, the voice length control rewards, and the prosody alignment rewards; The updating unit is used for adopting a intra-group relative strategy optimization algorithm, taking the intra-group relative advantages as a guide signal, taking the expectation of maximizing the intra-group relative advantages as an optimization target, and updating parameters of the single codebook speech generation model through a back propagation mechanism to obtain a target model; And the processing unit is used for responding to a text-to-speech instruction triggered based on the target text and the target reference audio, and processing the target text and the target reference audio by utilizing the target model to obtain target speech.
9. A computer device, the computer device comprising: And a processor executing the instructions stored in the memory to implement the multiple rewards mechanism based text-to-speech method of any of claims 1 to 7.
10. A computer readable storage medium having stored therein at least one instruction for execution by a processor in a computer device to implement the multiple rewards mechanism based text-to-speech method of any of claims 1 to 7.

Description

Text-to-speech method, device, equipment and medium based on multiple rewarding mechanism Technical Field The invention relates to the technical field of artificial intelligence, in particular to a text-to-speech method, a device, equipment and a medium based on a multiple rewarding mechanism. Background Currently, many fields involve the task of converting text in a certain reference speech. For example, in the financial field, an account management system of a bank can refer to professional and steady voice parameters to convert month bill texts of users into voices so as to replace traditional short message notification to automatically report bill details to the users, and in the medical health care field, an intelligent diagnosis guiding system of a hospital can refer to mild and clear voice parameters to convert diagnosis flow texts into voices, so that patients are guided to complete diagnosis procedures. The existing Text-to-Speech (TTS) model, especially the TTS large model based on single-codebook (single-codebook) architecture, is widely applied to the field of multi-language and zero-sample Speech generation due to compact structure and strong streaming capability. However, such models still have various performance drawbacks in practical applications, and it is difficult to meet the generation requirements of high stability, high naturalness and high prosody consistency. Specifically, the current single codebook TTS model mainly faces the following commonalities: (1) The rhythm control capability is insufficient, the rhythm is unstable, and as the single bar code book simultaneously carries semantic and acoustic information, the decoding strategy of the single bar code book is easy to fluctuate in rhythm aspects such as rhythm, pause and the like, so that phenomena such as rhythm tugs, pause dislocation, unnatural mood and the like can occur in the actual generated voice, and the decoding strategy is more obvious in long texts or sentences with complex grammar structures. (2) The speaker consistency is insufficient (SPEAKER DRIFT) in zero sample or cross-text speech generation, the sound rays between different sentence paragraphs may deviate, which is manifested as tone color change, speech speed drift or unstable acoustic style, affecting the overall hearing consistency of speech generation. (3) The problems of unstable text-voice alignment, missed reading, repetition, early termination and the like can occur, namely, the existing autoregressive TTS large language model can converge in advance or delay termination due to insufficient strategy learning in the generation process, so that the generation length is not matched with a target sentence. This length instability is one of the technical pain points currently prevalent in the industry. (4) The method lacks an expandable rhythm supervision mechanism, the current stage model depends on the rhythm structure of implicit learning, and the method lacks definite rhythm and pause template guidance, so that the rhythm alignment capability cannot be optimized continuously and robustly. The traditional method lacks a structured prosodic label source, and has extremely high manual labeling cost, thus being unfavorable for large-scale training. In view of the above, how to generate high quality audio based on text has become a problem to be solved at present. Disclosure of Invention In view of the foregoing, it is desirable to provide a method, apparatus, device and medium for converting text to speech based on multiple rewards mechanism, which aims to solve the problem that high quality audio cannot be generated based on text. A text-to-speech method based on a multiple rewarding mechanism, the text-to-speech method based on a multiple rewarding mechanism comprising: Constructing a training sample, wherein the training sample comprises an initial text of a voice to be generated and a reference audio; Preprocessing the initial text and the reference audio to obtain a plurality of text samples corresponding to the initial text and audio samples corresponding to the reference audio; inputting the plurality of text samples into an inference type large language model to obtain a prosody template; Inputting the plurality of text samples into a single codebook speech generation model to obtain a set of candidate speech samples corresponding to each text sample; utilizing each candidate voice sample and the audio sample to construct semantic consistency rewards, voiceprint consistency rewards, voice length control rewards and rhythm alignment rewards; For each candidate sample in the same group, constructing a group relative advantage according to the semantic consistency rewards, the voiceprint consistency rewards, the voice length control rewards and the prosody alignment rewards; Adopting a relative strategy optimization algorithm in a group, taking the relative advantages in the group as a guide signal, taking the expectation of maximizing the relative a