CN-121983021-A - Self-adaptive voice synthesis method based on multi-mode emotion feature fusion

CN121983021ACN 121983021 ACN121983021 ACN 121983021ACN-121983021-A

Abstract

The invention relates to the technical field of speech synthesis, in particular to a self-adaptive speech synthesis method based on multi-mode emotion feature fusion. The method comprises the steps of firstly obtaining a target text sequence and an initial reference emotion audio signal, obtaining a target phoneme sequence and the target reference emotion audio signal through preprocessing, extracting a target text feature sequence through a transform text encoder based on rotary position encoding, obtaining a time-varying style feature sequence containing emotion intonation, rhythm and timbre through a style encoder, realizing accurate alignment of bimodal features through a cross attention mechanism, completing depth fusion through a gating network to obtain a multi-modal fusion condition feature, generating a frame-level condition feature through a time length predictor, generating a Mel frequency spectrum through a condition stream matching decoder, and finally converting the Mel frequency spectrum into a voice signal through a vocoder. The scheme breaks through the pain points with stiff emotion expression, inaccurate characteristic alignment and insufficient real-time performance in the prior art, realizes fine granularity emotion expression and high-efficiency reasoning while guaranteeing high fidelity sound quality and natural rhythm, is suitable for intelligent customer service, virtual digital people and other scenes, and remarkably improves man-machine interaction experience.

Inventors

QI JING
ZHOU PENG
LI TANG

Assignees

深圳智驿未来科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260211

Claims (10)

1. The self-adaptive voice synthesis method based on the multi-mode emotion feature fusion is characterized by comprising the following steps of: S1, acquiring a target text sequence to be synthesized and an initial reference emotion audio signal, converting the target text sequence into a target phoneme sequence, resampling the initial reference emotion audio signal to a preset sampling rate, and removing a mute segment to obtain a target reference emotion audio signal; s2, inputting the target phoneme sequence into a transform text encoder based on rotary position encoding to encode so as to obtain a target text feature sequence Inputting the target reference emotion audio signal into a style encoder for encoding to obtain a target reference emotion style characteristic sequence ; S3, characteristic sequence of target text through cross attention mechanism And a target reference emotion style feature sequence Performing alignment fusion to obtain an alignment emotion feature sequence ; S4, the target text feature sequence And aligning emotion feature sequences Fusion is carried out on the input gating network to obtain Multi-mode fusion condition feature sequence ; S5, fusing the multi-mode fusion condition feature sequences And a target text feature sequence Inputting the length of continuous frame of each phoneme into a length predictor after channel dimension splicing, and copying the phonemes according to the length of continuous frame of each phoneme to obtain frame-level condition characteristics ; S6, according to the frame-level condition characteristics A mel-frequency spectrum is generated using a conditional stream matching decoder, and the generated mel-frequency spectrum is converted into a voice signal by a vocoder.
2. The method of claim 1, wherein the inputting the target phoneme sequence into the rotary position encoding-based Transformer text encoder for encoding comprises: s201, mapping a target phoneme sequence into an initial word vector through an embedding layer, and performing preliminary feature extraction on the initial word vector through a pre-network consisting of 3 layers of one-dimensional convolutions to obtain local context texture features, wherein a ReLU activation function and layer normalization are arranged after each layer of one-dimensional convolutions; S202, inputting the local context texture features into a transform encoder based on rotary position coding to obtain a target text feature sequence by coding Wherein in each transducer coding layer, when multi-head attention is paid to the input features, rotation operation is performed on the query vector and the key vector in complex domain: Wherein, the Is a preset rotation angle parameter; the position index of the feature in the sequence is entered, And Respectively the first two before rotation Query vectors and key vectors corresponding to the respective positions; And Respectively indicate the post-rotation first And carrying out multi-head attention on the input features through the query vector and the key vector after the rotation operation.
3. The method for adaptive speech synthesis based on multi-modal emotion feature fusion of claim 1, wherein said style encoder comprises a pre-trained Wav2Vec2 style encoder, said Wav2Vec2 style encoder comprising a feature extractor for inputting a target reference emotion audio signal into a 7-layer one-dimensional convolution component, a convolution kernel of decreasing size and step size layer by layer, down-sampling a time domain waveform into a potential acoustic representation, inputting the potential acoustic representation into a context network composed of 12-layer transform coding layers, and finally outputting the target reference emotion style feature sequence via a linear projection layer 。
4. The adaptive speech synthesis method based on multimodal emotion feature fusion of claim 1, wherein the target text feature sequence is generated by a cross-attention mechanism And a target reference emotion style feature sequence Performing alignment fusion includes: S31 when the target refers to the emotion style feature sequence Feature number of (a) When the target text feature sequence is made to be the first The features of each position correspond to the first of the target reference emotion style feature sequences Feature of each position, target reference emotion style feature sequence through cyclic tiling Extending from M frames to N frames, wherein, N represents the feature number of the target text feature sequence; Performing modular operation; S32, mapping the target text feature sequence into a query vector Q through a linear layer, and mapping the expanded target reference emotion style feature sequence into a key vector K and a value vector V through the linear layer; S33, calculating a correlation matrix between the target text feature sequence and the target reference emotion style feature sequence through an attention mechanism according to the query vector Q and the key vector K: Wherein, the Representing a correlation matrix between the target text feature sequence and the target reference emotion style feature sequence; Representing a scaling factor; representing a transpose; And S34, weighting the value vector V through dot products according to a correlation matrix between the target text feature sequence and the target reference emotion style feature sequence to obtain an aligned emotion feature sequence.
5. The adaptive speech synthesis method according to claim 1, wherein the target text feature sequence is selected from the group consisting of And aligning emotion feature sequences The fusing of the input gating network comprises the following steps: s41, target text feature sequence And aligning emotion feature sequences Splicing in the channel dimension to obtain a combined characteristic sequence; S42, inputting the combined characteristic sequence into a gating network prediction adjustment coefficient matrix, wherein the gating network comprises a first linear layer, siLU activation functions, a second linear layer and Sigmoid activation functions which are sequentially cascaded; S43, fusing the target text feature sequence and the alignment emotion feature sequence by adopting a residual connection mode according to the adjustment coefficient matrix to obtain a multi-mode fusion condition feature sequence: Wherein, the Representing a multi-mode fusion condition feature sequence; representing a target text feature sequence; Representing a matrix of adjustment coefficients; representing an alignment emotion feature sequence; representing element-by-element multiplication; Representation layer normalization.
6. The adaptive speech synthesis method according to claim 1, wherein the step S5 includes: S51, fusing the multi-mode fusion condition feature sequences With the target text feature sequence Splicing in the channel dimension to obtain a duration prediction input tensor ; S52, predicting duration and inputting the duration into tensor The input duration predictor predicts the predicted duration of each phoneme in the logarithmic domain, performs exponential operation and rounding up on the predicted duration of each phoneme in the logarithmic domain to obtain a target frame number corresponding to each phoneme, and copies the feature corresponding to each phoneme in the target text feature sequence according to the target frame number corresponding to each phoneme to obtain a frame-level conditional feature The time length predictor comprises a feature extraction network and a full-connection network, wherein the feature extraction network consists of 2 layers of one-dimensional convolution blocks, a layer normalization layer and a Dropout layer are connected behind each layer of one-dimensional convolution blocks, the output features of the feature extraction network serve as input features of the full-connection network, and the full-connection network outputs the predicted time length of each phoneme under the logarithmic domain.
7. The method for adaptive speech synthesis based on multimodal emotion feature fusion according to claim 1, wherein step S6 comprises: s61, sampling from standard Gaussian noise distribution to generate initial And initializing the time step ; S62, making the current time step , Representing time step and noise samples Current time step And frame-level conditional features Input conditional stream matching decoder predicting current velocity field vector ; S63, according to the current speed field vector Computing a current noise sample ; S64, repeatedly executing the steps S62-63 until the current time step Until the current noise sample is obtained As a mel spectrum, the generated mel spectrum is converted into a voice signal by a vocoder.
8. The adaptive speech synthesis method based on multi-modal emotion feature fusion of claim 7, wherein the conditional stream matching decoder comprises a noise mapping module, a time embedding module and N transform decoding layers, wherein each transform decoding layer is correspondingly provided with an adaptive layer normalization module; the noise mapping module is used for sampling noise High-dimensional hidden layer feature mapping by linear layer The time embedding module is used for stepping the current time through sinusoidal position coding Conversion to a high-dimensional vector followed by a multi-layer perceptron consisting of two linear layers and SiLU activation functions to obtain a time-embedded vector with timing sensitivity ; The adaptive layer normalization module is used for embedding time into the vector Extended to and frame level conditional features The same sequence length and the same as the frame level conditional feature Performing element-by-element addition, inputting the result into a linear regression layer, and outputting a scaling factor Translation factor : Wherein, the Representation SiLU activates a function; representing a linear regression layer; representing the extended time embedded vector; input features for each transform decoding layer Performs normalization processing and utilizes a scaling factor And a translation factor For input characteristics after normalization Affine transformation to obtain features Features are characterized by Inputting a first transducer decoding layer for decoding, wherein the input features of the first transducer decoding layer Is a high-dimensional hidden layer feature The output characteristics of the last transducer decoding layer are mapped back to the mel spectrum dimension through layer normalization and linear projection layers to obtain a velocity field vector 。
9. An adaptive speech synthesis system based on multi-modal emotion feature fusion, wherein the system comprises a memory and a processor, the memory is used for storing an application program, and the processor is used for running the application program and executing an adaptive speech synthesis method based on multi-modal emotion feature fusion according to any one of claims 1 to 8.
10. A computer storage medium, wherein a remote monitoring program is stored on the computer storage medium, and when the remote monitoring program is executed by a processor, the method for synthesizing adaptive speech based on multi-modal emotion feature fusion according to any one of claims 1 to 8 is implemented.

Description

Self-adaptive voice synthesis method based on multi-mode emotion feature fusion Technical Field The invention belongs to the technical field of speech synthesis, and particularly relates to a self-adaptive speech synthesis method based on multi-modal emotion feature fusion. Background The speech synthesis is one of core technologies of man-machine natural interaction, and is applied to a plurality of key scenes such as intelligent customer service, virtual digital people, audio books, auxiliary communication equipment and the like. Along with the continuous improvement of the interaction experience naturalness and individuation requirements of users, the basic conversion from text to voice can not meet the requirements, and the method has the advantages of fine granularity emotion expression, and self-adaptive voice synthesis capable of accurately matching text semantics and scene atmosphere, and becomes a key direction for breaking through the bottleneck of the prior art and improving the user experience. The emotion voice can not only enhance the infectivity of information transmission, but also reduce the separation sense of man-machine interaction, and the technical upgrade of the emotion voice has important practical significance for expanding the application boundary of voice synthesis. The existing speech synthesis technology has obvious limitations in emotion modeling and multi-modal fusion, in emotion expression, most schemes adopt global style vectors for modeling, namely, by extracting the whole emotion characteristics of reference audio and globally applying the whole emotion characteristics to whole sentence speech synthesis, the default whole sentence speech emotion is constant, dynamic emotion fluctuation (such as light and heavy changes of language and gradual changes of emotion) in the time dimension in actual communication cannot be captured, in multi-modal characteristic fusion, natural time scale mismatch exists between the discrete phoneme characteristics of the text and the continuous acoustic characteristics of the reference audio, the fusion is realized by adopting simple splicing or direct addition operation in the prior art, the accurate correspondence of 'phoneme-acoustic frames' is difficult to realize, the synthetic speech rhythm disorder emotion and semantic disjoint are easy to be caused, in the balance of tone quality and instantaneity, the scheme based on a diffusion model can promote the synthetic tone quality, but the reverse denoising process needs hundreds of iteration steps, the reasoning speed is slow, and the requirements of intelligent customer service real-time response, virtual digital man interaction and other scenes cannot be met. The problems cause the defects of the existing synthesized voice such as stiff emotion expression, unnatural rhythm, insufficient real-time performance and the like, and seriously affect the user experience. Therefore, there is a need for an adaptive speech synthesis scheme that can achieve dynamic emotion modeling, precise alignment of multi-modal features, and that combines high fidelity sound quality with real-time reasoning efficiency, so as to solve the pain points in the prior art. Disclosure of Invention In order to solve the problems in the background art, one aspect of the present invention provides an adaptive speech synthesis method based on multi-modal emotion feature fusion, which includes the following steps: S1, acquiring a target text sequence to be synthesized and an initial reference emotion audio signal, converting the target text sequence into a target phoneme sequence, resampling the initial reference emotion audio signal to a preset sampling rate, and removing a mute segment to obtain a target reference emotion audio signal; s2, inputting the target phoneme sequence into a transform text encoder based on rotary position encoding to encode so as to obtain a target text feature sequence Inputting the target reference emotion audio signal into a style encoder for encoding to obtain a target reference emotion style characteristic sequence; S3, characteristic sequence of target text through cross attention mechanismAnd a target reference emotion style feature sequencePerforming alignment fusion to obtain an alignment emotion feature sequence; S4, the target text feature sequenceAnd aligning emotion feature sequencesFusion is carried out on the input gating network to obtain Multi-mode fusion condition feature sequence; S5, fusing the multi-mode fusion condition feature sequencesAnd a target text feature sequenceInputting the length of continuous frame of each phoneme into a length predictor after channel dimension splicing, and copying the phonemes according to the length of continuous frame of each phoneme to obtain frame-level condition characteristics; S6, according to the frame-level condition characteristicsA mel-frequency spectrum is generated using a conditional stream matching decoder, and the generated mel-freq