CN-121983025-A - Zero-sample singing voice synthesizing and editing method and system

CN121983025ACN 121983025 ACN121983025 ACN 121983025ACN-121983025-A

Abstract

The invention relates to the technical field of artificial intelligence and voice processing, in particular to a method and a system for synthesizing and editing zero-sample singing voice. The method comprises the following steps of constructing a model architecture, performing online melody learning and joint optimization, performing melody and content alignment constraint, performing weak annotation duration modeling, performing reinforcement learning post-training, training and reasoning. So that the singing voice of any lyrics and any reference melody can be synthesized while maintaining high quality audio output.

Inventors

ZHENG JUNJIE
CHEN ZIHAO
DING CHAOFAN

Assignees

巨人移动技术有限公司

Dates

Publication Date: 20260505
Application Date: 20260206

Claims (10)

1. A method for synthesizing and editing zero-sample singing voice, comprising the steps of: S1, constructing a model architecture, namely adopting Diffusion Transformer as a core architecture, combining a melody extraction module and a multi-target reinforcement learning strategy, constructing an online melody extractor comprising two tightly coupled components, forming a singing voice synthesis model based on Diffusion Transformer and the online melody extractor, directly extracting a frame-level melody representation from input audio by the online melody extractor, and synthesizing singing voice based on a singing voice synthesis module of Diffusion Transformer, an audio prompt, lyrics and the extracted melody representation; S2, online melody learning and combined optimization are carried out, namely, distillation constraint based on KL divergence is introduced, and a pre-trained teacher melody model is used for providing stable supervision; s3, aligning the melody with the content, namely introducing a loss function based on center check alignment, and explicitly restricting the correlation between the internal representation and the input melody representation in the singing voice synthesis process; S4, modeling weak annotation duration, namely optimizing duration modeling by using weak annotation song data only comprising sentence-level time stamps, so that the model can automatically infer reasonable duration distribution without phoneme alignment; S5, training after reinforcement learning, namely designing a multi-objective rewarding function comprising content accuracy rewarding and melody similarity rewarding, adopting a Flow-GRPO strategy optimization method, and improving performance of a rewarding guide model; S6, training and reasoning, namely initializing Diffusion Transformer decoder parameters from a pre-trained singing voice synthesis model, training by using weakly labeled song data without phoneme level alignment; in the reasoning stage, the time stamp is only used for separating the tone color prompt from the generated content, and the lyrics and melody reference audio are input to generate corresponding singing voice.
2. The method of zero sample singing voice synthesis and editing of claim 1, wherein KL divergence is Kullback-Leibler Divergence, also known as relative entropy, which is an important indicator in information theory for measuring the difference between two probability distributions.
3. A method of zero sample singing voice synthesis and editing as recited in claim 1, wherein the center checkup is CENTERED KERNEL ALIGNMENT, abbreviated CKA.
4. A method of zero sample singing voice synthesis and editing as recited in claim 3, wherein the melody and content alignment constraints are as follows: Calculating a similarity distribution matrix of the MIDI of the reference song and the acoustic flow representation of the model, wherein MIDI is Musical Instrument DIGITAL INTERFACE, namely an automatic voice recognition model; Progressively minimizing similarity distribution differences between the reference song MIDI and the model acoustic stream representation; the penalty function is aligned by the center line to maximize the correlation between the melody representation and the internal features of the model to ensure a high degree of consistency of the generated singing voice with the inputted melody.
5. The method of zero sample singing voice synthesis and editing of claim 1, wherein the content accuracy rewards are calculated based on word error rate of an automatic speech recognition model, wherein the automatic speech recognition model is Automatic Speech Recognition Model, abbreviated as ASR model; the melody similarity reward is calculated based on a pearson correlation coefficient between the generated pitch track and the reference pitch track.
6. A system for zero sample singing synthesis and editing using the method of any one of claims 1-5, the system comprising: The model construction module is configured to adopt Diffusion Transformer as a core framework, combine the melody extraction module and the multi-target reinforcement learning strategy, construct an online melody extractor comprising two tightly coupled components, form a singing voice synthesis model based on Diffusion Transformer and the online melody extractor, directly extract frame-level melody representation from input audio, and perform singing voice synthesis based on the singing voice synthesis module of Diffusion Transformer, audio prompt, lyrics and the extracted melody representation; The online melody learning and joint optimization module is configured to introduce distillation constraint based on KL divergence, and provide stable supervision by using a pre-trained teacher melody model; the melody and content alignment constraint module is configured to introduce a loss function based on center check alignment, explicitly constrain the correlation between the internal representation and the input melody representation in the singing voice synthesis process, and ensure the high consistency of the generated singing voice and the input melody by maximizing CKA values between the internal features of the stream model, wherein the CKA values are center check alignment values; The weak annotation duration modeling module is configured to optimize duration modeling by using weak annotation song data only comprising sentence-level time stamps, so that the model can automatically infer reasonable duration allocation without phoneme alignment; the training module after reinforcement learning is configured to design a multi-objective rewarding function comprising content accuracy rewarding and melody similarity rewarding, adopts a Flow-GRPO strategy optimization method and improves performance of a rewarding guide model; The training and reasoning module is configured to initialize Diffusion Transformer decoder parameters from a pre-trained singing synthesis model, train using weakly labeled song data without phoneme level alignment, and in the reasoning stage, the time stamp is used only to separate the timbre cues from the generated content, input lyrics and melody reference audio, and generate the corresponding singing.
7. The system of zero sample singing voice synthesis and editing of claim 6, wherein KL divergence is Kullback-Leibler Divergence, also known as relative entropy, is an important indicator in information theory for measuring the difference between two probability distributions.
8. A system for zero sample singing synthesis and editing as recited in claim 6, wherein the center checkup is CENTERED KERNEL ALIGNMENT, abbreviated CKA.
9. The system for zero sample singing voice synthesis and editing of claim 8, wherein the melody and content alignment constraint module is implemented as follows: Calculating a similarity distribution matrix of the MIDI of the reference song and the acoustic flow representation of the model, wherein MIDI is Musical Instrument DIGITAL INTERFACE, namely an automatic voice recognition model; Progressively minimizing similarity distribution differences between the reference song MIDI and the model acoustic stream representation; the penalty function is aligned by the center line to maximize the correlation between the melody representation and the internal features of the model to ensure a high degree of consistency of the generated singing voice with the inputted melody.
10. The system of zero sample singing voice synthesis and editing of claim 6, wherein the content accuracy rewards are calculated based on word error rates of an automatic speech recognition model, automatic Speech Recognition Model, abbreviated ASR model; the melody similarity reward is calculated based on a pearson correlation coefficient between the generated pitch track and the reference pitch track.

Description

Zero-sample singing voice synthesizing and editing method and system Technical Field The invention relates to the technical field of artificial intelligence and voice processing, in particular to a method and a system for synthesizing and editing zero-sample singing voice. Background The singing voice synthesizing (Singing Voice Synthesis, SVS) technology has wide application prospects in the fields of music production, virtual singers, personal creatives, interactive media and the like. However, the existing singing voice synthesis technology has the following main problems: (1) The existing system needs accurate phoneme level time length and pitch annotation in training and reasoning stages, which not only needs special data production flow, but also prevents the acquisition of large-scale training data, and severely limits the wide application and industrial deployment of singing voice synthesis technology. (2) Only fixed lyric-melody pairs are supported, the existing method only generally supports the fixed lyric-melody pairs which are seen in the training process, and when a user tries to replace lyrics, mix languages or change a music grammar structure, the problems of robot pronunciation, rhythm dislocation, unnatural sentence breaking and the like are often caused due to mismatching of the number of phonemes and the beat of the melody, so that the hearing experience is greatly reduced. (3) The zero sample capability is lacking, and most of the existing methods have obviously reduced performance when encountering unseen text or prosodic structures, and cannot meet the actual application requirements. (4) The data marking cost is high, a great deal of manual labor is required for obtaining accurate MIDI rhythm and phoneme level duration marking, and songs in any style or language are difficult to expand. Therefore, it is desirable to provide a method and system for synthesizing and editing zero sample singing sounds that can synthesize arbitrary lyrics with the singing sounds of any reference melody while maintaining high quality audio output. Disclosure of Invention The invention aims to provide a method and a system for synthesizing and editing zero-sample singing voice, which can synthesize singing voice of any lyrics and any reference melody and simultaneously maintain high-quality audio output. In order to solve the problems in the prior art, the invention provides a method for synthesizing and editing zero-sample singing voice, which comprises the following steps: S1, constructing a model architecture, namely adopting Diffusion Transformer as a core architecture, combining a melody extraction module and a multi-target reinforcement learning strategy, constructing an online melody extractor comprising two tightly coupled components, forming a singing voice synthesis model based on Diffusion Transformer and the online melody extractor, directly extracting a frame-level melody representation from input audio by the online melody extractor, and synthesizing singing voice based on a singing voice synthesis module of Diffusion Transformer, an audio prompt, lyrics and the extracted melody representation; S2, online melody learning and combined optimization are carried out, namely, distillation constraint based on KL divergence is introduced, and a pre-trained teacher melody model is used for providing stable supervision; s3, aligning the melody with the content, namely introducing a loss function based on center check alignment, and explicitly restricting the correlation between the internal representation and the input melody representation in the singing voice synthesis process; S4, modeling weak annotation duration, namely optimizing duration modeling by using weak annotation song data only comprising sentence-level time stamps, so that the model can automatically infer reasonable duration distribution without phoneme alignment; S5, training after reinforcement learning, namely designing a multi-objective rewarding function comprising content accuracy rewarding and melody similarity rewarding, adopting a Flow-GRPO strategy optimization method, and improving performance of a rewarding guide model; S6, training and reasoning, namely initializing Diffusion Transformer decoder parameters from a pre-trained singing voice synthesis model, training by using weakly labeled song data without phoneme level alignment; in the reasoning stage, the time stamp is only used for separating the tone color prompt from the generated content, and the lyrics and melody reference audio are input to generate corresponding singing voice. Optionally, in the method for synthesizing and editing the zero-sample singing voice, the KL divergence is Kullback-Leibler Divergence, also called relative entropy, which is an important index for measuring the difference between two probability distributions in the information theory. Optionally, in the method of synthesizing and editing the zero sample singing voice, the center core alig