CN-122024702-A - Voice conversion method, device, equipment and storage medium based on stream matching model

CN122024702ACN 122024702 ACN122024702 ACN 122024702ACN-122024702-A

Abstract

The invention discloses a voice conversion method, a device, equipment and a storage medium based on a stream matching model, which relate to the technical field of voice conversion and can be applied to the financial and medical fields, and comprise the steps of obtaining source voice and target voice; extracting a first voice characteristic of the source voice and a second voice characteristic of the target voice respectively, wherein the first voice characteristic comprises a content characteristic and a rhythm characteristic of the source voice, the second voice characteristic comprises a tone characteristic of the target voice, condition information is constructed based on the content characteristic, the rhythm characteristic and the tone characteristic, the target Mel frequency spectrum characteristic is obtained through iterative generation through a pre-trained stream matching model under the condition of the condition information, and the target Mel frequency spectrum characteristic is synthesized into a time domain voice waveform. According to the scheme, clear content and natural rhythm of source voice can be effectively maintained in the tone color conversion process, and the technical problem of fuzzy pronunciation or rhythm distortion caused by characteristic coupling in the traditional scheme is solved.

Inventors

ZHENG ZHE
CHEN MINCHUAN
WANG SHAOJUN

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260512
Application Date: 20260115

Claims (10)

1. A method for speech conversion based on a stream matching model, comprising: Acquiring a source voice to be converted and a target voice serving as a tone reference; respectively extracting a first voice characteristic of the source voice and a second voice characteristic of the target voice, wherein the first voice characteristic comprises a content characteristic and a rhythm characteristic, and the second voice characteristic comprises a tone characteristic; constructing condition information based on the content features, the prosodic features, and the timbre features; Iteratively generating through a pre-trained stream matching model under the condition of the condition information to obtain a target Mel frequency spectrum characteristic, wherein the target Mel frequency spectrum characteristic has the tone of the target voice and keeps the content and rhythm of the source voice; and synthesizing the target Mel spectrum characteristics into a time domain voice waveform.
2. The method of claim 1, wherein the extracting the first speech feature of the source speech and the second speech feature of the target speech, respectively, comprises: Inputting the source voice into a pre-trained self-supervision voice model, and extracting hidden layer characteristics output by an encoder of the source voice as the content characteristics; Extracting fundamental frequency characteristics and energy characteristics of the source voice, smoothing and discretizing the fundamental frequency characteristics, mapping the fundamental frequency characteristics into vector representations with fixed dimensions through an embedding layer, and taking the vector representations and the energy characteristics as prosodic characteristics; And extracting filter group characteristics of the target voice, inputting the filter group characteristics into a pre-trained speaker recognition model, and extracting speaker embedded vectors output by the recognition model as the tone characteristics.
3. The method of claim 1, wherein the first speech feature further comprises a first logarithmic mel spectrum and the second speech feature further comprises a second logarithmic mel spectrum; the constructing condition information based on the content feature, the prosodic feature, and the timbre feature includes: fusing the prosodic features with the first logarithmic mel frequency spectrum to obtain source-end fusion features; Fusing the prosodic features with the second logarithmic mel frequency spectrum to obtain target fusion features; replacing the initial preset length part of the source end fusion feature with the corresponding part of the target fusion feature to form a conditional fusion feature; The condition information at least comprises the condition fusion characteristic and the tone characteristic.
4. The method according to claim 1, wherein the iteratively generating by a pre-trained stream matching model based on the condition information to obtain a target mel-frequency spectrum feature comprises: Presetting a plurality of time steps which are sequentially increased from an initial time to a termination time; For the first time step, inputting the condition information, a time variable corresponding to the first time step and initial state data sampled from standard Gaussian distribution into the stream matching model to obtain a corresponding prediction vector field; Evolving the initial state data into state data corresponding to a next time step by solving a normal differential equation based on the predictive vector field; The state data obtained by evolution at the previous moment is used as current state data, and the following steps are iterated until the termination moment is reached, namely the current state data, a time variable corresponding to the current time step and the condition information are input into the stream matching model to obtain a current prediction vector field; And outputting the state data corresponding to the termination time as the target Mel frequency spectrum characteristic.
5. The method of claim 1, wherein synthesizing the target mel-spectral feature into a time-domain speech waveform comprises: and inputting the target Mel frequency spectrum characteristics into a nerve vocoder for decoding to obtain the time domain voice waveform.
6. The method according to any one of claims 1-5, wherein the flow matching model is obtained by the following training method: acquiring a plurality of training data pairs, wherein each training data pair comprises a source training voice and a target training voice; For each training data pair, constructing training condition information according to the voice conversion method based on the stream matching model, and taking the mel frequency spectrum characteristics of the target training voice as a training target; during the training process, the following optimization steps are iteratively performed: a. Randomly sampling a training data pair to obtain training condition information and training targets; b. Randomly sampling a time variable and sampling a random noise from a standard gaussian distribution; c. Calculating an interpolation state corresponding to the time variable according to the time variable, the random noise and the training target; d. Inputting the time variable, the interpolation state and the training condition information into the flow matching model, and predicting a predicted training vector field through a neural network corresponding to the flow matching model; e. Calculating an error between the predictive training vector field and a reference vector field, the reference vector field characterizing a true transform direction from the random noise to the training target; f. Adjusting parameters of the neural network according to the errors; wherein the neural network is converged by minimizing the error to obtain a trained stream matching model.
7. The method of claim 6, wherein the target training speech in the training data pair is timbre transformed from the source training speech using an external timbre transformation model.
8. A speech conversion device based on a stream matching model, comprising: the acquisition module is used for acquiring the source voice to be converted and the target voice serving as a tone reference; the feature extraction module is used for respectively extracting a first voice feature of the source voice and a second voice feature of the target voice, wherein the first voice feature comprises a content feature and a rhythm feature, and the second voice feature comprises a tone feature; The condition construction module is used for constructing condition information based on the content characteristics, the rhythm characteristics and the tone characteristics; The model processing module is used for carrying out iterative generation through a pre-trained stream matching model under the condition of the condition information to obtain a target Mel frequency spectrum characteristic, wherein the target Mel frequency spectrum characteristic has the tone of the target voice and keeps the content and rhythm of the source voice; and the voice synthesis module is used for synthesizing the target Mel frequency spectrum characteristics into a time domain voice waveform.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the stream matching model based speech conversion method according to any of claims 1-7 when executing the computer program.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the stream matching model based speech conversion method according to any one of claims 1-7.

Description

Voice conversion method, device, equipment and storage medium based on stream matching model Technical Field The invention relates to the technical field of voice conversion, and can be applied to the fields of finance and medical treatment, in particular to a voice conversion method, device, equipment and storage medium based on a stream matching model. Background The speech conversion (VoiceConversion, VC) technique aims to convert the timbre of the source speaker to that of the target speaker while keeping the semantic content and prosodic features of the speech unchanged. With the development of deep learning, methods based on self-encoders or generating countermeasure networks have become the mainstream of the field. Despite significant advances in these approaches, their model architecture typically hybrid encodes the content, prosody and timbre information of speech, resulting in coupling at the feature characterization level. The coupling makes it difficult for the model to accurately distinguish and independently control the source voice characteristics to be reserved and the target tone color to be modified in the conversion process, so that the original content definition and the prosody naturalness of the source voice are easily interfered when the tone color is changed, and the converted voice is caused to have pronunciation blurring or prosody distortion. For example, in financial customer service scenarios, slight pronunciation ambiguity or intonation distortion may affect the customer's accurate understanding of critical instructions (e.g., transfer amount, business code), even weakening the customer's sense of trust in service expertise and security. Therefore, how to overcome the pronunciation ambiguity and prosody distortion caused by feature coupling has become a technical problem to be solved. Disclosure of Invention The embodiment of the invention provides a voice conversion method, a device, equipment and a storage medium based on a stream matching model, which are used for solving the technical problem of pronunciation ambiguity or prosody distortion caused by feature coupling in the traditional scheme. In a first aspect, a speech conversion method based on a stream matching model is provided, including: Acquiring a source voice to be converted and a target voice serving as a tone reference; respectively extracting a first voice characteristic of the source voice and a second voice characteristic of the target voice, wherein the first voice characteristic comprises a content characteristic and a rhythm characteristic, and the second voice characteristic comprises a tone characteristic; constructing condition information based on the content features, the prosodic features, and the timbre features; Iteratively generating through a pre-trained stream matching model under the condition of the condition information to obtain a target Mel frequency spectrum characteristic, wherein the target Mel frequency spectrum characteristic has the tone of the target voice and keeps the content and rhythm of the source voice; and synthesizing the target Mel spectrum characteristics into a time domain voice waveform. In a second aspect, a speech conversion apparatus based on a stream matching model is provided, including: the acquisition module is used for acquiring the source voice to be converted and the target voice serving as a tone reference; the feature extraction module is used for respectively extracting a first voice feature of the source voice and a second voice feature of the target voice, wherein the first voice feature comprises a content feature and a rhythm feature, and the second voice feature comprises a tone feature; The condition construction module is used for constructing condition information based on the content characteristics, the rhythm characteristics and the tone characteristics; The model processing module is used for carrying out iterative generation through a pre-trained stream matching model under the condition of the condition information to obtain a target Mel frequency spectrum characteristic, wherein the target Mel frequency spectrum characteristic has the tone of the target voice and keeps the content and rhythm of the source voice; and the voice synthesis module is used for synthesizing the target Mel frequency spectrum characteristics into a time domain voice waveform. In a third aspect, a computer device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the above-mentioned speech conversion method based on a stream matching model when executing the computer program. In a fourth aspect, a computer readable storage medium is provided, the computer readable storage medium storing a computer program, which when executed by a processor, implements the above-described speech conversion method based on a stream matching model. The technical scheme provided by the inventi