CN-121983020-A - Zero sample speech synthesis method and system for descriptive guidance

CN121983020ACN 121983020 ACN121983020 ACN 121983020ACN-121983020-A

Abstract

The invention relates to the technical field of voice synthesis, in particular to a zero sample voice synthesis method and system for description guidance. According to the invention, zero sample speech synthesis is performed based on conditional flow matching and diffusion transform, a target acoustic portrait is constructed through text description, and a speech sampling process is guided accurately, so that high-fidelity speech generation under fine adjustment of data without specific speaking person is realized.

Inventors

Chai Xiunan
ZHENG JUNJIE
CHEN ZIHAO
DING CHAOFAN

Assignees

巨人移动技术有限公司

Dates

Publication Date: 20260505
Application Date: 20260206

Claims (9)

1. A method of describing guided zero-sample speech synthesis, comprising the steps of: S1, receiving a text sequence to be synthesized, an audio prompt, a description prompt and a generated duration parameter input from the outside; S2, extracting tone characteristic vectors of the audio prompts And a description hint embedding vector of the description hint ; S3, embedding the description prompt into a vector by utilizing an acoustic portrait mapping module Decoding into acoustic portrait vector As, said acoustic portrait vector Defining an explicit acoustic parameter space for generating speech; S4, embedding vectors according to the description prompt through a multi-mode fusion module As a query matrix, at the timbre eigenvector In search of matching acoustic details, generating intent-first joint condition vectors ; S5, initializing a sampling canvas according to the generated time length parameter, and combining the condition vectors by a diffusion transducer in the condition stream matching sampling process And the acoustic portrait vector Performing feature rendering under distributed anchor constraints; s6, in the sampling iterative process, correcting the intermediate latent variable in real time through a sampling track dynamic calibration mechanism Deviation of the attribute distribution relative to the description hint; And S7, completing full-path sampling according to the generated time length parameter, and reconstructing and outputting a target voice waveform to complete voice synthesis.
2. The description-guided zero-sample speech synthesis method of claim 1, wherein the description hint embeds a vector The acoustic priori is learned in advance by an internalization distillation mode: In the training stage, global acoustic representation of target voice is extracted on line by using an external acoustic teacher model as a teacher signal, and the coding branch corresponding to the description prompt is forced to learn the mapping relation from text description to acoustic representation space 。
3. The description-guided zero-sample speech synthesis method of claim 2, wherein the processing logic of the multi-modal fusion module comprises: Embedding the description hint into a vector using a gated cross-attention mechanism Setting the tone characteristic vector as a query matrix And adjusting the fusion weight of the description prompt and the audio prompt through the learnable gating coefficient, and establishing the execution priority of the description prompt on the generation of the global voice attribute regulation.
4.A description-guided zero-sample speech synthesis method according to claim 3, wherein the acoustic representation vector At least comprises a fundamental frequency mean value prior, and the acoustic portrait vector As a pitch prediction branch inside the acoustic bias injection generator, the pitch generation trajectory of each frame is forced to be locked within a preset acoustic frequency interval by the distributed anchoring constraint.
5. The description-guided zero-sample speech synthesis method of claim 4, wherein the sample trajectory dynamic calibration mechanism is implemented as follows: in each step of the sampling iteration, the intermediate latent variable at the current moment is calculated Is embedded with the description hint embedded vector Consistency scores between if an acoustic attribute is detected to deviate from the acoustic portrait vector Calculating the scoring relative to the current latent variable And carrying out path correction on the sampling track by taking the gradient as correction displacement.
6. The description-guided zero-sample speech synthesis method of claim 5, wherein the generator employs a long-sequence architecture based on rotational position coding for adapting a variable-length sampling canvas defined by the externally-input generation duration parameters.
7. The description-guided zero-sample speech synthesis method of claim 6, wherein the generation duration directly defines a noise sequence length of a potential space by the generation duration parameter, and wherein the system maintains consistency of acoustic characteristics with description intent over the length range.
8. A description-guided zero-sample speech synthesis system, characterized in that the description-guided zero-sample speech synthesis system is built using the description-guided zero-sample speech synthesis method according to any one of claims 1-7, the description-guided zero-sample speech synthesis system comprising: the input processing module is configured to extract tone color features, describe semantic coding and internalize acoustic feature perception; The acoustic portrait mapping module is configured to convert description semantics into an explicit acoustic parameter prior for guiding a generation process; A multimodal fusion module configured to retrieve audio details with descriptive intent as a guide and to process control priorities of the bimodal information; A conditional flow matching generation module configured to perform flow matching sampling within a specified duration canvas under the constraint of the acoustic portrait vector; The sampling track calibration module is configured to monitor and correct the acoustic attribute offset in real time in the sampling iteration process; And the waveform reconstruction module is configured to restore the generated spectrum characteristics into target voice conforming to the descriptive image.
9. A sample trajectory calibration method for generating speech synthesis, characterized in that a sample trajectory calibration is performed using a zero sample speech synthesis system guided by the description as claimed in claim 8, the sample trajectory calibration method comprising the steps of: extracting an instantaneous acoustic feature map of the current intermediate latent variable in each discrete sampling step of the ordinary differential equation solver; calculating a semantic consistency gradient between the transient acoustic feature map and a preset description condition; and performing displacement compensation on the intermediate latent variable by using the gradient so as to perform instant intervention on the evolution direction of the sampling track, and ensuring that the sampling path converges on an acoustic cluster conforming to the preset description condition.

Description

Zero sample speech synthesis method and system for descriptive guidance Technical Field The invention relates to the technical field of voice synthesis, in particular to a zero sample voice synthesis method and system for description guidance. Background In the Zero-sample speech synthesis (Zero-Shot TTS) task, the system aims to capture timbre details through extremely short Audio prompts (audioprompt) and generate target speech in combination with macroscopic attributes (such as gender, intonation, speech speed) set by descriptive prompts (Description Prompt). However, the prior art has the following bottlenecks: (1) Firstly, symmetrical design is mostly adopted in multi-mode prompt fusion logic. Audio cues have been developed to provide "timbre features" in the microscopic dimension, while descriptive cues provide "acoustic portrait instructions" in the macroscopic dimension. The symmetrical fusion mechanism fails to prioritize the weights of the two, so that when prompt information conflicts (such as clunk with reference to audio but clear and crisp description requirements), the generated result often deviates from the intention of a user, and the instruction following performance is poor. (2) Secondly, most of the generation and sampling processes based on the flow matching or diffusion model are open-loop control, and a real-time calibration mechanism is lacked. When the speech speed is adjusted according to the external input duration parameter, nonlinear physical offset is very easy to occur to the sampling track in the latent space due to excessive extrusion or stretching of the time axis. Such shifts often manifest as physical attribute collapse (e.g., a female acoustic dip originally anchored to male acoustic at rapid speech rates), resulting in difficulty in maintaining stable acoustic feature consistency under variable length production conditions. Therefore, there is a need to provide a description guided zero sample speech synthesis method and system to solve the problem of intended follow-up failure and the problem of track drift generation Disclosure of Invention The invention aims to provide a zero sample voice synthesis method and system for descriptive guidance, which ensure the intention alignment of a generated result when prompt information conflicts and realize the steady performance of tone consistency under any time length generation task. In order to solve the problems in the prior art, the invention provides a description-guided zero-sample speech synthesis method, which comprises the following steps: S1, receiving a text sequence to be synthesized, an audio prompt, a description prompt and a generated duration parameter input from the outside; S2, extracting tone characteristic vectors of the audio prompts And a description hint embedding vector of the description hint; S3, embedding the description prompt into a vector by utilizing an acoustic portrait mapping moduleDecoding into acoustic portrait vector As, said acoustic portrait vectorDefining an explicit acoustic parameter space for generating speech; S4, embedding vectors according to the description prompt through a multi-mode fusion module As a query matrix, at the timbre eigenvectorIn search of matching acoustic details, generating intent-first joint condition vectors; S5, initializing a sampling canvas according to the generated time length parameter, and combining the condition vectors by a diffusion transducer in the condition stream matching sampling processAnd the acoustic portrait vectorPerforming feature rendering under distributed anchor constraints; s6, in the sampling iterative process, correcting the intermediate latent variable in real time through a sampling track dynamic calibration mechanism Deviation of the attribute distribution relative to the description hint; And S7, completing full-path sampling according to the generated time length parameter, and reconstructing and outputting a target voice waveform to complete voice synthesis. Optionally, in the description-guided zero-sample speech synthesis method, the description hint embeds a vectorThe acoustic priori is learned in advance by an internalization distillation mode: In the training stage, global acoustic representation of target voice is extracted on line by using an external acoustic teacher model as a teacher signal, and the coding branch corresponding to the description prompt is forced to learn the mapping relation from text description to acoustic representation space 。 Optionally, in the description guided zero sample speech synthesis method, the processing logic of the multimodal fusion module includes: Embedding the description hint into a vector using a gated cross-attention mechanism Setting the tone characteristic vector as a query matrixAnd adjusting the fusion weight of the description prompt and the audio prompt through the learnable gating coefficient, and establishing the execution priority of the description prompt on the genera