Search

CN-121983069-A - Sound cloning method, system, electronic equipment and medium

CN121983069ACN 121983069 ACN121983069 ACN 121983069ACN-121983069-A

Abstract

The invention relates to the technical field of audio data processing, in particular to a sound cloning method, a system, electronic equipment and a medium, which comprise the steps of forming a stable tone anchor segment library for input reference voice; the method comprises the steps of performing front-end processing on target texts, weighting anchor segments, adding the anchor segments to basic speaker condition vectors of the window to form enhanced speaker condition vectors, synthesizing candidate acoustic results based on the enhanced speaker condition vectors, extracting speaker state characterization from the candidate acoustic results, calculating continuity residual errors between the speaker state characterization and anchor center vectors with consistent states, enabling the continuity memory to participate in speaker condition injection of subsequent windows in an attenuation mode, inputting the acoustic results after partial backfilling into a vocoder, generating target voice waveforms and outputting the target voice waveforms. The invention solves the problems of sound cloning under the conditions that the reference voice has limited duration and uneven quality, the target text contains a plurality of pause boundaries and near-real-time output is required.

Inventors

  • HE YUNCHANG
  • YANG GUANGWEI
  • LI WEI
  • LIU BO
  • ZHOU LI

Assignees

  • 重庆麦芽传媒有限公司

Dates

Publication Date
20260505
Application Date
20260407

Claims (10)

  1. 1. A method of sound cloning, the method comprising: Dividing the input reference voice into segments, calculating stability scores for each segment, distributing the segments meeting the warehousing conditions into state categories according to the sounding state characteristics of the segments, and weighting and constructing anchor center vectors of each state category according to the stability scores of the segments to form a stable tone anchor segment library; Selecting an anchor segment subset with consistent states from a stable tone anchor segment library according to the target sounding state of the window, weighting each anchor segment, and then overlapping the anchor segment subset to a basic speaker condition vector of the window to form an enhanced speaker condition vector, and synthesizing candidate acoustic results based on the enhanced speaker condition vector; when the continuity residual error and the boundary risk score exceed the respective preset threshold values, only partial backfill and resynthesis are carried out on frames in the limited range before and after the boundary window, so as to generate a correction vector, the correction vector is written into a continuity memory, and the continuity memory participates in the speaker condition injection of the subsequent window in an attenuation mode; And inputting the acoustic result after the partial backfill into a vocoder, generating a target voice waveform and outputting the target voice waveform.
  2. 2. The voice cloning method according to claim 1, wherein the method for segmenting the input reference voice, calculating a stability score for each segment, assigning a state class to a segment satisfying a warehouse-in condition according to its sounding state characteristics, and constructing an anchor center vector of each state class by weighting according to the stability scores of the segments comprises: let the total number of fragments after the segmentation of the reference voice be First, the Numbering of individual fragments Take the value from 1 to Extracting the first Harmonic index of individual segments Formant continuity index Confidence of sound Duty cycle of noise Adjacent frame frequency hopping variable Wherein By means of the short-time autocorrelation function calculation, By normalized reciprocal calculation of euclidean distance of formant trajectories of adjacent frames, Obtained by the combined judgment of the frame-level short-time energy and the zero crossing rate, By means of the calculation of the flatness of the frequency spectrum, Calculating the mean value of the distance between the Meier frequency spectrums and the cosine of the adjacent frames Stability scoring of individual fragments : ; Dividing the sounding state category into a continuous sounding section, a pause starting section and a weak energy sounding section; for continuous voiced segments and weak-energy voiced segments to Exceeding a preset stability general threshold As a warehouse-in condition, for the post-pause starting section to Exceeding a preset spectrum hopping threshold And is also provided with Is positioned in a preset sound confidence interval as a warehouse entry condition and is lower than Is a preset attack stability lower limit As an auxiliary stability constraint; Let the total number of anchor segments in the warehouse be Wherein ≤ First, the Numbering of individual binned anchor segments Take the value from 1 to First, a third step The speaker embedding vector of each warehouse-in anchor segment is Its stability score is recorded as The belonging state category is recorded as an integer Status category Corresponding all warehouse-in anchor segment numbers are set as The anchor center vector for the state class The method comprises the following steps: 。
  3. 3. The method of claim 2, wherein the selecting a subset of anchor segments of consistent state from the steady tone anchor segment library according to the target sound state of the window comprises: For the first Each warehouse-in anchor segment extracts three-dimensional state vector from audio signal thereof Wherein The starting energy slope is obtained by normalizing the linear regression slope of the frame-level root mean square energy sequence of the segment starting segment; the energy mean value is steady-state energy mean value, and is obtained through normalization of the frame-level root mean square energy mean value of the middle section of the section relative to the maximum frame-level root mean square energy in the section; The silence duty ratio is obtained by the ratio of the number of frames with sound confidence coefficient lower than a preset sound judgment threshold value to the total number of frames of the segment; For each boundary window in the target text Outputting the three-dimensional target state vector of the window by the prosody prediction model Three components thereof 、 、 Respectively with three-dimensional state vectors In (a) and (b) 、 、 Corresponding to the above; Based on stability scoring pair Three-dimensional state vector weighting of each anchor segment in the tree, and constructing state categories Is a three-dimensional state anchor center of (2) : ; Calculate the first Three-dimensional state vector of each warehouse-in anchor segment And the first Three-dimensional object state vector of each boundary window State distance between : ; Calculation of Three-dimensional state anchor center associated with each state class The Euclidean distance between the window and the window, and the state class integer value with the minimum Euclidean distance is taken as the attribution sounding state class of the window Will be As a subset of anchor segments that are consistent in state.
  4. 4. A sound cloning method according to claim 3, wherein the method of calculating a boundary risk score for each boundary window comprises: setting the preamble drift trend of the first boundary window And to the first Outputting punctuation prosody boundary triggering quantity by a text front end analysis module through a boundary window Outputting the predicted energy drop by the prosody prediction model Resetting amplitude from predicted fundamental frequency Outputting the silent-to-voiced switching strength by the pause detection module To take the following steps 、 、 、 The maximum value in (2) is taken as the instant risk leading quantity of the current window, multiplied by the amplified term of the preamble drift, and the boundary risk score is calculated : ; Wherein the method comprises the steps of For the preset preamble drift amplification factor, Is the first Risk score of each boundary window when Exceeding a preset risk threshold When in use, for Middle (f) Anchor segments, let quality penalty term Calculating a weighting coefficient : ; Wherein the method comprises the steps of Is that The weighted result is superimposed to the basic speaker condition vector output by the conventional acoustic model Forming an enhanced speaker condition vector : 。
  5. 5. The sound cloning method of claim 4, wherein the extracting speaker state tokens from the candidate acoustic results, calculating continuity residuals between the speaker state tokens and state-consistent anchor center vectors, and performing local backfill resynthesis only on frames within a limited range before and after a boundary window when the continuity residuals and boundary risk scores simultaneously exceed respective preset thresholds comprises: From enhancing speaker condition vectors Synthesis of the first The acoustic decoder synchronously outputs the acoustic log probability of each frame of the window in the synthesis process, and averages the acoustic log probability to obtain the confidence coefficient in the candidate window Re-extracting the speaker state characterization for the candidate acoustic results by the speaker encoder Calculating And status category Corresponding anchor center vector Continuity residual between : ; When (when) Exceeding a preset residual threshold And is also provided with Exceeding a preset risk threshold At the time, set up For extremely small positive number, calculate the gate control intensity : ; Wherein the method comprises the steps of Is that Function of generating correction vector : ; Front of boundary window Frame to back Within the frame range to correct the vector After the speaker condition is regulated, partial backfill resynthesis is implemented, and the rest frames are not recalculated Or (b) When the preset threshold values are not exceeded, the method enables Is a zero vector.
  6. 6. The sound cloning method of claim 5, wherein the method of writing correction vectors to a continuity memory that is engaged in speaker condition injection for subsequent windows in an attenuated manner comprises: setting a continuity memory initial value Zero vector, finish After processing the boundary windows, the continuity memory is updated : ; Wherein the method comprises the steps of Is a preset attenuation factor, when When the windows are not triggered to be backfilled For zero vector, recursive degradation to Natural memory decay; In the first place In speaker condition construction of each boundary window, continuity is memorized Superimposed to the enhanced speaker condition vector: 。
  7. 7. the sound cloning method of claim 5, further comprising adaptively updating the preset residual threshold value, comprising: Recording the near term Continuity residual value of boundary window of triggered partial backfill(s), the first The recorded value is Calculating a sliding average value : ; Let the residual error threshold before updating be Post-update residual threshold The method comprises the following steps: ; Wherein the method comprises the steps of Is a preset smoothing coefficient when When rising When increasing When lowering And (3) reducing.
  8. 8. A sound cloning system, said system comprising: The preprocessing module is used for segmenting the input reference voice, calculating stability scores for each segment, distributing the segments meeting the warehousing conditions into state categories according to the sounding state characteristics of the segments, and weighting and constructing anchor center vectors of each state category according to the stability scores of the segments to form a stable tone anchor segment library; The evaluation module is used for performing front-end processing on the target text, and calculating boundary risk scores for all boundary windows, wherein the boundary windows with the boundary risk scores exceeding a preset risk threshold value select an anchor segment subset with consistent states from a stable tone anchor segment library according to the target sounding state of the window, weight all anchor segments and then superimpose the anchor segments on basic speaker condition vectors of the window to form enhanced speaker condition vectors, and synthesize candidate acoustic results based on the enhanced speaker condition vectors; The feature extraction module is used for extracting speaker state characterization from candidate acoustic results, and calculating continuity residual errors between the speaker state characterization and the state-consistent anchor center vector, when the continuity residual errors and the boundary risk scores exceed respective preset thresholds, only partial backfill resynthesis is carried out on frames in a limited range before and after a boundary window, a correction vector is generated, the correction vector is written into a continuity memory, and the continuity memory participates in speaker condition injection of a subsequent window in an attenuation mode; And the output module is used for inputting the acoustic result after the partial backfill into the vocoder, generating a target voice waveform and outputting the target voice waveform.
  9. 9. An electronic device, comprising: a memory for storing a computer program; A processor for implementing the sound cloning method of any one of claims 1-7 when executing a program stored on a memory.
  10. 10. A medium having stored thereon a computer program, which when executed by a processor, implements the sound cloning method of any one of claims 1-7.

Description

Sound cloning method, system, electronic equipment and medium Technical Field The invention relates to the technical field of audio data processing, in particular to a sound cloning method, a sound cloning system, electronic equipment and a medium. Background The voice cloning technology enables a text-to-speech system to synthesize any text content by voice of a target speaker by extracting voice characteristics of the speaker in reference voice, and the technology is widely applied to scenes such as digital personal broadcasting, intelligent customer service, voice book making, long text reading and the like. In some application scenes, the reference voice is usually recorded temporarily by a user or intercepted from the existing recording, the duration is generally not more than fifteen seconds, the recording environment is not controlled, unstable fragments such as inspiration sound, background noise, emotion fluctuation or tail-end doubt and rising are included, but target texts which are required to be converted according to the reference voice are long texts, the target texts comprise multiple commas, periods, doubt pauses, parallel structures or continuous soundless consonants, the synthesis duration is generally tens of seconds or even minutes, the output delay is strictly limited for applications such as digital man broadcasting, online customer service and the like for synthesis after voice cloning, and the whole sentence or whole section cannot be synthesized again when tone quality problems occur each time. Existing voice cloning systems typically compress the whole reference speech into a global static speaker vector and inject it as a fixed condition into the acoustic model throughout the synthesis process. This approach works well in short text-to-sentence synthesis scenarios. However, during long text synthesis, the acoustic decoder may have a short time reset of its internal hidden state when processing pause locations, unvoiced consonant clusters, and re-attack locations, during which the decoder's reliance on text prosody is increased and the constraint response to global speaker vectors is decreased. Since the global vector is mixed into unstable frames (such as inspiration sound and noise segments) in the reference voice, the representation of the tone color characteristics is not pure, and further, at the moment of the reset of the internal state, the tone color characteristics of the candidate synthesis result are locally deviated, and subjective hearing is represented as resonance position mutation, change of the nasal sound degree or age drift, namely, a phenomenon of 'slight change of people'. Once the deviation occurs in a certain boundary window, the composition of the subsequent windows will follow the acoustic state that has drifted, resulting in the cumulative spread of the deviation over time, and the subsequent half of the long text composition will appear as persistent timbre identity instability. The remedial means in the existing engineering, such as whole sentence regeneration, have high calculation cost and introduce unacceptable delay; the multi-reference average scheme can promote the overall similarity, but lacks the targeted closed loop correction capability for a single high-risk boundary window. Therefore, under the overlapping constraint that the reference speech duration is limited and the quality is not uniform, the target text contains multiple pause boundaries, and near real-time output is required, how to identify which boundary windows have higher tone drift risk, how to perform local correction rather than whole sentence recalculation on these high risk windows only, and how to prevent the propagation of the deviation at the boundary to the backward window is a technical problem that has not been solved in the prior art. Disclosure of Invention (1) Technical problem to be solved The invention aims to provide a sound cloning method, a system, electronic equipment and a medium, which are used for solving the technical problems of how to identify which boundary windows have higher tone drift risks, how to execute local correction rather than whole sentence recalculation on the high-risk windows and how to prevent deviation at the boundary from spreading to a backward window under the overlapping constraint that the reference voice duration is limited and the quality is not uniform, the target text contains a plurality of pause boundaries and near-real-time output is needed. (2) Technical proposal To achieve the above object, in one aspect, the present invention provides a sound cloning method, the method comprising: Dividing the input reference voice into segments, calculating stability scores for each segment, distributing the segments meeting the warehousing conditions into state categories according to the sounding state characteristics of the segments, and weighting and constructing anchor center vectors of each state category according to the stability