CN-122024704-A - Context self-adaption-based multi-round interactive emotion voice synthesis method and system

CN122024704ACN 122024704 ACN122024704 ACN 122024704ACN-122024704-A

Abstract

The invention provides a context self-adaptive multi-round interactive emotion voice synthesis method and system, which belong to the technical field of voice synthesis, wherein the method comprises the steps of acquiring instant multi-round conversations with single-round historical voice as a context window and a current text to be synthesized; directly deconstructing and predicting emotion and multidimensional auxiliary language acoustic characteristics from the historical voice signals through a context self-adaptive characteristic predictor trained in two stages; converting the predicted features into usable standardized control parameters through a feature parameter mapping mechanism of a preset standardized template; based on normalized parameters and a text to be synthesized, a voice synthesis model is driven to generate natural coherent target voice, and the system comprises a data acquisition module, a context self-adaptive feature prediction module, a feature parameter mapping module and a voice synthesis module. The invention realizes the dual optimization of emotion accuracy and voice naturalness under the multi-round interaction scene, and remarkably improves the comprehensive performance of the acoustic quality and emotion expression capacity of the synthesized voice.

Inventors

WEI QINGLAN
ZHANG YUAN
XUE RUIQI
ZHANG PEIJUE

Assignees

中国传媒大学

Dates

Publication Date: 20260512
Application Date: 20260226

Claims (10)

1. The context self-adaption-based multi-round interactive emotion voice synthesis method is characterized by comprising the following steps of: Step S1, in an instant multi-round dialogue scene which takes adjacent single-round historical voices as a context window, acquiring a historical voice signal sequence and a current text to be synthesized in the multi-round dialogue, wherein the historical voice signal sequence does not need to be subjected to voice-to-text ASR processing; Step S2, selecting a basic audio large model, constructing a context self-adaptive feature predictor through two-stage training, self-adapting the context and predicting the due emotion and the secondary language features of the reply voice, and constructing and outputting an initial acoustic feature set, wherein the two-stage training specifically comprises: The first stage training comprises the steps of constructing an artificial annotation data set which contains acoustic features, specifically, basic features, emotion features and auxiliary language features, making quantitative annotation standards to perform standard annotation on various features, enabling all annotations to be presented in a natural language form, and performing fine adjustment on an audio large model by utilizing the artificial annotation data set until the model has feature information deconstructing capability of single voice, wherein mathematical expressions are as follows: ; Wherein, the For a manually noted description of natural language, In order to input a speech signal, A first stage model deconstructed for a single piece of speech feature information; And training in the second stage, namely inputting the voice of the incremental dialogue data set into the audio large model trained in the first stage, deconstructing feature information to obtain acoustic features of each voice, then constructing a historical voice-reply voice feature pair data set by using adjacent single-round voices as context windows, and performing incremental training on the audio large model by utilizing the historical voice-reply voice feature pair data set until the audio large model can predict the initial acoustic features of the reply voice based on the historical voice, wherein the mathematical expression is as follows: ; Wherein, the In order to recover the original acoustic characteristics of the speech, For the second stage model for context prediction, In order to reply to the voice, Is in combination with The next round of speech, i.e. for reply speech History of speech of (a); Step S3, a characteristic parameter mapping mechanism comprising a pre-constructed mapping rule template and characteristic field analysis is established, an initial acoustic characteristic set is input into the characteristic parameter mapping mechanism, and standardized acoustic control parameters are output, and the method specifically comprises the following steps: Step S31, performing parameter optimization of a characteristic parameter mapping mechanism, generating multiple natural language prompt word variants with the same semantics through a frozen text large model, manually assisting in screening and supplementing, taking a Mel Cepstrum Distortion (MCD) as a quantitative evaluation standard, grading and sorting the similarity between voice samples generated by different prompt word variants and original voice, and determining the optimal prompt word after at least one round of testing; S32, constructing a mapping rule template according to the optimal prompt word, and defining a standardized mask mapping Slot-mapping paradigm by using the mapping rule template, wherein the standardized mask mapping Slot-mapping paradigm comprises a global acoustic state template, a local auxiliary language characteristic event anchoring template and an accent control template; Step S33, extracting key fields of the input initial acoustic features by a feature field analysis module, wherein the key fields comprise feature variables, appearance positions and duration, and structurally mapping the fields to predefined mask positions by an optimal mapping rule template to generate control signals for controlling subsequent voice synthesis, and outputting standardized acoustic control parameters; and S4, inputting the standardized acoustic control parameters and the reply text into a voice synthesis model to generate emotion-coherent natural voice conforming to the context.
2. The context-adaptive multi-round interactive emotion voice synthesis method of claim 1, wherein in step S2, a basic audio large model is Qwen-audio, and a LoRA fine tuning method is adopted for training.
3. The context-adaptive multi-round interactive emotion voice synthesis method according to claim 1, wherein in step S21, the data set is manually labeled, and the acoustic characteristics and quantitative labeling criteria are as follows: In the basic characteristics, the speaking speed is divided by the number of words per minute WPM, wherein the speaking speed is less than or equal to 130WPM and is low, the speaking speed is less than or equal to 130WPM and is equal to 160WPM and is medium, the speaking speed is more than or equal to 161WPM and is high, the pitch is set according to the average value of fundamental frequencies, the male pitch is less than or equal to 100Hz, the 101Hz is less than or equal to 140Hz, the high pitch is more than or equal to 141Hz, the female pitch is less than or equal to 180Hz, the 181Hz is less than or equal to 240Hz, and the high pitch is more than or equal to 241Hz; affective characteristics are based on potency Degree of activation Degree of dominance The 5-level scale of (2) is defined quantitatively, specifically: ; Wherein, the consistency judgment of not less than 70% of labels is required to be satisfied by other classes; The auxiliary language characteristic is set with a time-frequency detection threshold value, wherein the smile time length is more than or equal to 200ms, the auxiliary language characteristic has obvious harmonic or burst structure, the breath sound time length is more than or equal to 150ms, the obvious soundless source inspiration or expiration section, the sigh time length is more than or equal to 300ms, and the remarkable energy attenuation and the fundamental frequency downlink track are accompanied; wherein, the auxiliary language features are marked with start and stop positions, and the duration precision is reserved to 0.001s.
4. The context-adaptive-based multi-pass interactive emotion speech synthesis method according to claim 1, wherein in step S31, the MCD reflects the difference of spectral features by calculating euclidean distance between the synthesized speech and the reference speech on mel-frequency cepstrum coefficient MFCCs, and the lower the value, the closer the generated speech is to the original speech in timbre and prosody, and the calculation formula is: ; Wherein, the To at the first Frame No The mel-cepstral coefficients of the original speech are maintained, Is the first Frame No Mel-cepstral coefficients of the synthesized speech of dimension, For the number of feature dimensions, For the total number of frames, For converting units into decibels.
5. The context-adaptive-based multi-round interactive emotion voice synthesis method as recited in claim 1, wherein in step S32, the global acoustic state template adopts a fixed sentence pattern of The speed of speech is , the pitch is , and the emotion is Carrying speaking speed, pitch and emotion type parameters; the local auxiliary language characteristic event anchoring template is adopted by The comes after the word , duration: Second ", maps trigger positions and durations of laughter, breathing sounds, etc.; the accent control template is' Emphasize word ", Explicitly specifies the playback position.
6. The context-adaptive-based multi-pass interactive emotion voice synthesis method of claim 1, wherein in step S33, the mathematical expression of the mapping layer is: Wherein, the In order to map the rules of the rule, In order to perform the feature extraction process, Control signals for controlling the subsequent speech synthesis.
7. The context-adaptive multi-round interactive emotion voice synthesis method of claim 1, wherein in step S4, the voice synthesis model is CosyVoice2, the pre-training model version is CosyVoice-0.5B, and the CosyVoice2-0.5B pre-training model is subjected to prompt word fine tuning, so that natural language prompt words and standardized acoustic control parameters are supported to jointly drive voice generation.
8. A context-adaptive-based multi-round interactive emotion voice synthesis system for realizing the context-adaptive-based multi-round interactive emotion voice synthesis method according to claims 1-7, which is characterized by comprising a data acquisition module, a context-adaptive feature prediction module, a feature parameter mapping module and a voice synthesis module; The data acquisition module comprises a historical voice acquisition unit and a reply text receiving unit; the context adaptive feature prediction module comprises a first stage training unit and a second stage training unit; The feature parameter mapping module comprises a prompt word optimization evaluation unit, a mapping rule template construction unit and a feature field analysis and structuring mapping unit; The voice synthesis module comprises an input fusion unit, an emotion-rhythm regulation unit and a voice generation unit.
9. The context-adaptive-based multi-round interactive emotion speech synthesis system of claim 8, wherein the data acquisition module is communicatively connected to the context-adaptive feature prediction module and the speech synthesis module, the context-adaptive feature predictor module is communicatively connected to the feature parameter mapping module, and the feature parameter mapping mechanism module is communicatively connected to the speech synthesis module.
10. The context-adaptive multi-round interactive emotion voice synthesis system of claim 8, wherein the voice synthesis module is communicatively connected to a conversation robot, and outputs the target voice through the conversation robot.

Description

Context self-adaption-based multi-round interactive emotion voice synthesis method and system Technical Field The invention relates to the technical field of speech synthesis, in particular to a context self-adaptive multi-round interactive emotion speech synthesis method and system. Background Text-to-speech (TTS) synthesis technology is used as a core support for man-machine voice interaction, and is widely applied to numerous scenes such as intelligent assistants, intelligent customer service, vehicle-mounted terminals and the like, and the technology development focuses on improving the degree of matching between voice naturalness and emotion expression. In a static single sentence synthesis scene, along with the progress of technologies such as end-to-end modeling, large language model fusion and the like, the representative models such as FASTSPEECH, vall-E and the like have high voice naturalness, and part of models have emotion and style adjusting capability initially by introducing a characteristic control mechanism. However, in a typical interactive scenario of multiple rounds of conversations, speech synthesis still has difficulty in leveraging historical information, affecting context matching. At present, researches on multi-round interactive speech synthesis are mainly conducted through two ideas, namely, historical speech is transcribed into text according to an Automatic Speech Recognition (ASR) technology, emotion characteristics in the text are extracted through emotion analysis models and are blended into a subsequent speech synthesis process, such as a conversation TTS system like Audiogpt, but a scheme of relying on speech to text (ASR) inevitably loses fine-grained auxiliary language characteristics like intonation, pause, accent and the like in original speech, the characteristics are core carriers for emotion expression and context fit, the text transcription process is prone to cause loss or ambiguity of emotion information, accuracy of context understanding and continuity of synthesized speech in multi-round conversations are affected, and the core of the emotion and prosody of the synthesized speech is regulated and controlled in a mode of text prompt words, reference audio embedded vectors and the like by means of feature controllable speech synthesis, such as Mellotron, prompTTS and the like models. In addition, the existing multi-round interactive voice synthesis system has the defect of the utilization efficiency of acoustic features, even if part of models try to introduce historical information, the emotion and auxiliary language features in the historical context are difficult to be efficiently converted into control parameters which can be directly utilized by the synthesis system due to the lack of effective feature prediction and standardized mapping mechanisms, and the naturalness and emotion compliance of the synthesized voice are further limited. Disclosure of Invention The invention aims to provide a context self-adaptive multi-round interactive emotion voice synthesis method and system, which are used for realizing emotion and rhythm dynamic self-adaptive voice synthesis under multi-round interaction based on acoustic characteristics dynamically regulated and controlled by historical dialogue memory and solving the problems of parameter disorder caused by loss of fine-grained emotion information and lack of standardization of characteristic parameter mapping and insufficient continuous inheritance of emotion and auxiliary language characteristics under multi-round context in the prior art. In order to achieve the above purpose, the invention provides a context-adaptive multi-round interactive emotion voice synthesis method, which comprises the following steps: Step S1, in an instant multi-round dialogue scene which takes adjacent single-round historical voices as a context window, acquiring a historical voice signal sequence and a current text to be synthesized in the multi-round dialogue, wherein the historical voice signal sequence does not need to be subjected to voice-to-text ASR processing; Step S2, selecting a basic audio large model, constructing a context self-adaptive feature predictor through two-stage training, self-adapting the context and predicting the due emotion and the secondary language features of the reply voice, and constructing and outputting an initial acoustic feature set, wherein the two-stage training specifically comprises: The first stage training comprises the steps of constructing an artificial annotation data set which contains acoustic features, specifically, basic features, emotion features and auxiliary language features, making quantitative annotation standards to perform standard annotation on various features, enabling all annotations to be presented in a natural language form, and performing fine adjustment on an audio large model by utilizing the artificial annotation data set until the model has feature information deconstruct