CN-121983027-A - Optimization method and system for singing voice generation reinforcement learning based on stream matching

CN121983027ACN 121983027 ACN121983027 ACN 121983027ACN-121983027-A

Abstract

The invention relates to the technical field of singing voice synthesis and conversion, in particular to an optimization method and an optimization system for singing voice generation reinforcement learning based on stream matching. The invention provides an optimization method and system for generating reinforcement learning based on singing voice of stream matching. By introducing a reinforcement learning mechanism in the post-model training stage and utilizing relative evaluation information among the generated results, strategy level optimization is carried out on the music generation model, so that the performance of the generated music in the aspects of lyric accuracy, melody consistency, overall hearing quality and the like is effectively improved on the premise of not increasing additional manual labeling cost.

Inventors

CHEN GONGYU
CHEN ZIHAO
DING CHAOFAN

Assignees

巨人移动技术有限公司

Dates

Publication Date: 20260505
Application Date: 20260206

Claims (8)

1. An optimization method for generating reinforcement learning based on singing voice of stream matching is characterized by comprising the following steps: S1, data construction and generation; s11, selecting Chinese and English singing voice audio samples with the duration between 5 seconds and 28 seconds from an original singing voice audio set, and taking the selected Chinese and English singing voice audio samples as candidate audios; s12, then, introducing an automatic aesthetic evaluation scoring model to perform automatic quality evaluation on candidate audios, and performing joint screening according to content pleasure indexes and content usefulness indexes output by the model, and only reserving audio samples with content pleasure indexes and content usefulness indexes of not less than 10 points to construct high-quality singing voice audio samples; s2, designing a model optimization method: S21, the model takes random Gaussian noise as an initial state, gradually predicts a corresponding speed field in a continuous time domain, and accumulates and generates target singing voice frequency through gradual integration of the speed field; S22, under the same input condition, the guiding model generates a plurality of candidate singing voice results, and takes the relative comparison relation among the candidate singing voice results as a reinforcement learning optimizing signal for updating a stream matching generation strategy so as to optimize the reinforcement learning of singing voice generation based on stream matching.
2. The optimization method for stream-matched singing voice generation reinforcement learning of claim 1, wherein the automated aesthetic assessment scoring model is Meta Audiobox Aesthetics.
3. The optimization method of stream matching-based singing voice generation reinforcement learning as recited in claim 1, wherein in S21 cumulatively generating the target singing voice frequency, the generating process is solved based on deterministic ordinary differential equations, or/and random differential equations are introduced to enhance exploratory and diversity in the sampling process.
4. The optimization method for stream-matched singing voice generation reinforcement learning as recited in claim 1, wherein the input conditions of S22 include lyric text, target tone characteristics, and reference melody.
5. The optimization method for stream matching-based singing voice generation reinforcement learning of claim 1, wherein the intra-group relative comparisons are cooperatively accomplished by a plurality of automated rewards models including an automated speech recognition rewards model, a speaker similarity rewards model, an audio aesthetic rewards model, and a fundamental frequency consistency rewards model.
6. The optimization method for stream-matched singing voice generation reinforcement learning of claim 5, wherein the relative manner in the group is as follows: calculating corresponding dominance values based on a plurality of automatic rewarding models in the group respectively, and carrying out weighted fusion on various dominance according to preset weights to obtain a comprehensive dominance evaluation result; and updating and optimizing model parameters based on the comprehensive advantage evaluation result to guide the stream matching singing voice generation model to ensure consistency of timbre, lyrics and melody and improve overall generation quality and auditory performance.
7. The optimization method for generating reinforcement learning based on singing voice of stream matching as recited in claim 6, wherein the model training optimization procedure includes a pre-training phase and a reinforcement learning optimization phase; In the pre-training stage, performing supervised learning training on a stream matching singing voice generation model based on Diffusion Transformer structures, so that the model learns basic generation capacity from input conditions to singing voice frequency and obtains stable generation performance; after the pre-training is completed, a reinforcement learning optimization stage is entered, under the condition that the Diffusion Transformer structure is kept unchanged, a plurality of candidate singing voice results are generated by the model for the same input condition, the candidate results are subjected to intra-group relative evaluation by utilizing an automatic rewarding model, and the generation strategy is updated by the intra-group relative evaluation results so as to optimize the reinforcement learning of the singing voice generation based on stream matching.
8. An optimization system for generating reinforcement learning based on singing voice of stream matching, characterized in that the optimization system for generating reinforcement learning based on singing voice of stream matching is established by adopting the optimization method as claimed in any one of claims 1 to 7.

Description

Optimization method and system for singing voice generation reinforcement learning based on stream matching Technical Field The invention relates to the technical field of singing voice synthesis and conversion, in particular to an optimization method and an optimization system for singing voice generation reinforcement learning based on stream matching. Background The existing singing voice generation model training target mainly focuses on matching of low-layer acoustic characteristics, is difficult to directly describe quality indexes of sense levels such as lyric definition, melody consistency and overall hearing feeling, and is easy to generate problems of fuzzy pronunciation, melody deviation or unstable generation results under zero sample, cross-style or complex melody conditions. The existing generation model based on stream matching or diffusion has the problems that the existing generation model mainly depends on supervised learning training in music generation tasks such as singing voice conversion (Singing Voice Conversion, SVC), singing voice synthesis (Singing Voice Synthesis, SVS) and the like, and subjective hearing feeling is difficult to directly optimize and high-level semantic consistency is difficult. Therefore, it is necessary to provide an optimization method and system for generating reinforcement learning based on singing voice of stream matching, so as to effectively improve the performance of the generated music in terms of lyric accuracy, melody consistency, overall hearing quality and the like. Disclosure of Invention The invention aims to provide an optimization method and system for generating reinforcement learning based on singing voice of stream matching, so as to effectively improve the performance of generated music in the aspects of lyric accuracy, melody consistency, overall hearing quality and the like. In order to solve the problems in the prior art, the invention provides an optimization method for generating reinforcement learning based on singing voice of stream matching, which comprises the following steps: S1, data construction and generation; s11, selecting Chinese and English singing voice audio samples with the duration between 5 seconds and 28 seconds from an original singing voice audio set, and taking the selected Chinese and English singing voice audio samples as candidate audios; s12, then, introducing an automatic aesthetic evaluation scoring model to perform automatic quality evaluation on candidate audios, and performing joint screening according to content pleasure indexes and content usefulness indexes output by the model, and only reserving audio samples with content pleasure indexes and content usefulness indexes of not less than 10 points to construct high-quality singing voice audio samples; s2, designing a model optimization method: S21, the model takes random Gaussian noise as an initial state, gradually predicts a corresponding speed field in a continuous time domain, and accumulates and generates target singing voice frequency through gradual integration of the speed field; S22, under the same input condition, the guiding model generates a plurality of candidate singing voice results, and takes the relative comparison relation among the candidate singing voice results as a reinforcement learning optimizing signal for updating a stream matching generation strategy so as to optimize the reinforcement learning of singing voice generation based on stream matching. Optionally, in the optimization method for generating reinforcement learning based on singing voice of stream matching, the automated aesthetic evaluation scoring model is Meta Audiobox Aesthetics. Optionally, in the optimization method for generating reinforcement learning based on singing voice of stream matching, in the step S21 of cumulatively generating the target singing voice, the generating process is solved based on a deterministic ordinary differential equation, or/and a random differential equation is introduced to enhance exploratory property and diversity in the sampling process. Optionally, in the optimization method for generating reinforcement learning based on singing voice of stream matching, the input conditions of S22 include lyric text, target tone characteristics and reference melody. Optionally, in the optimization method for generating reinforcement learning based on singing voice of stream matching, the intra-group relative comparison is cooperatively completed by multiple automated rewards models, wherein the multiple automated rewards models include an automatic voice recognition rewards model, a speaker similarity rewards model, an audio aesthetic rewards model and a fundamental frequency consistency rewards model. Optionally, in the optimization method for generating reinforcement learning based on singing voice of stream matching, the relative comparison manner in the group is as follows: calculating corresponding dominance values based on a plurality of a