CN-121983071-A - Stream matching singing beautifying method based on feature decoupling and masking reconstruction
Abstract
A method for beautifying the singing voice by matching the stream features based on feature decoupling and masking reconstruction includes such steps as extracting the Mel-frequency spectrogram of singing voice frequency, preprocessing, decoupling the multi-dimensional acoustic features of tone, pitch and content from original audio waveform and Mel-frequency spectrogram, merging the extracted conditional features, coding projection, randomly masking the original Mel-frequency spectrogram to obtain masking features, building a stream matching generation model, training the conditional features, masking features and original Mel-frequency spectrogram, obtaining a trained stream matching model, building mixed reasoning conditions, beautifying the rest singing voice based on the trained stream matching model, and inputting the beautified Mel-frequency spectrogram to obtain the beautified singing voice. The invention is used in the field of singing beautifying.
Inventors
- LI WENHUI
Assignees
- 哈尔滨工业大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260129
Claims (7)
- 1. A stream matching singing beautifying method based on feature decoupling and masking reconstruction is characterized by comprising the following specific processes: s1, extracting a Mel spectrogram of singing voice frequency and preprocessing; Step S2, decoupling and extracting multi-dimensional acoustic features from the original audio waveform and the Mel spectrogram, wherein the multi-dimensional acoustic features comprise tone, pitch and content features; s3, fusing and coding projection are carried out on the extracted condition features; S4, carrying out random masking on the original Mel spectrogram to obtain masking characteristics; S5, constructing a stream matching generation model, and training a condition feature, a masking feature and an original Mel spectrogram input model to obtain a trained stream matching model; And S6, constructing a mixed reasoning condition, beautifying amateur singing on the basis of a trained stream matching model, and inputting the beautified Mel spectrogram into a vocoder to obtain the beautified singing.
- 2. The method for beautifying the stream matching singing voice based on feature decoupling and masking reconstruction of claim 1, wherein the step S1 is characterized by extracting and preprocessing a Mel spectrogram of singing voice frequency, and comprises the following specific steps: S11, extracting a Mel frequency spectrum: (1) Parameter configuration and audio loading, namely configuring parameters required by Mel frequency spectrum extraction according to predefined acoustic super parameters, including sampling rate, FFT point number, mel frequency band number, frame shift, window length, and minimum frequency and maximum frequency of Mel filter bank; (2) Audio waveform normalization, namely carrying out dynamic range analysis on the resampled audio waveform, carrying out amplitude normalization processing, and strictly restricting waveform amplitude to a preset reasonable numerical value interval so as to prevent numerical value overflow or unstable calculation caused by overlarge amplitude in the subsequent spectrum calculation process; (3) Based on the above-mentioned acoustic super-parameter configuration, generating or obtaining correspondent Mel filter group matrix and window function from buffer memory, adopting buffer memory mechanism, and only once-producing identical parameter configuration and calculation equipment so as to raise efficiency when repeating calculation; (4) Short-time Fourier transform (STFT) is carried out on the normalized audio waveform by utilizing a buffered window function for framing, windowing and zero filling, then short-time Fourier transform is carried out to obtain a frequency domain representation in a complex form, and a linear magnitude spectrum is taken as an STFT magnitude spectrum; (5) Performing matrix multiplication operation on the STFT magnitude spectrum and a Mel filter matrix to realize mapping from a linear frequency domain to a Mel frequency domain, and obtaining a magnitude spectrum under the Mel scale; s12, feature normalization: (1) Counting the extremum of the training set, namely traversing the logarithmic Mel frequency spectrums of all samples in the current training set, and counting the global minimum and maximum as normalized reference boundaries; (2) And (3) linearly mapping and clipping, namely linearly mapping each Mel frequency spectrum characteristic value from an original interval to a target interval [ -1, 1], and executing boundary clipping on characteristic points exceeding the interval due to abnormal values and the like, so as to ensure that all the characteristic values are strictly positioned in the interval, thereby ensuring the consistency and robustness of model input.
- 3. The method for beautifying stream matching singing voice based on feature decoupling and masking reconstruction of claim 2, wherein in S2, multi-dimensional acoustic features including tone, pitch and content features are extracted from original audio waveforms and Mel spectrograms in a decoupling way, and the specific process is as follows: S21, extracting tone color characteristics: a CAM++ model is adopted as a tone encoder, and tone embedded vectors which are strongly related to the identity of singers are extracted from an input Mel spectrogram The extracted tone is embedded into a 192-dimensional fixed length vector, so that voice quality, vocal tract resonance characteristics and individual sounding habits of singers are effectively summarized, the characteristics are irrelevant to specific lyric content, melody trend and singing duration in design, in the subsequent singing beautifying flow, the characteristics are used as key condition signals to be injected into a generating model, the consistency of synthesized singing with the original singers in tone dimension is ensured, and identity confusion or tone drift is avoided; S22, extracting pitch characteristics: the RMVPE model is selected as a core component for pitch extraction, and the original base frequency sequence is subjected to two-step post-processing in order to adapt to the time sequence alignment requirement of a subsequent processing module: (1) Accurately aligning the time resolution to the frame rate of the target Mel frequency spectrum by linear interpolation or logarithmic domain interpolation; (2) Mapping continuous fundamental frequency values into normalized discrete pitch class codes to form a pitch feature sequence The sequence accurately describes the time-varying outline of the melody, not only contains the absolute height of notes, but also reserves micro-expression details such as sliding sound, tremolo and the like, and is a core basis for carrying out singing beautifying operations such as pitch correction, tone adjustment, rhythm alignment and the like; s23, extracting content characteristics: constructing a content encoder with Conformer architecture as a core for extracting content characterization related to voice semantics and content feature sequence The method encodes linguistic information such as what to sing (lyrics content) and how to pronounce (word biting mode, syllable duration and accent distribution), and is irrelevant to variables such as what to sing (tone) and what pitch to sing (melody), and in the process of beautifying singing, content characteristics are used as content fidelity constraints, so that even if the pitch or tone is obviously adjusted, the generated singing still keeps clear word biting, natural language stream rhythm and accurate language semantics, and lyrics distortion or pronunciation blurring caused by acoustic parameter modification is effectively prevented.
- 4. The method for beautifying stream matching singing voice based on feature decoupling and masking reconstruction of claim 3, wherein the step S3 of fusing and coding projection is performed on the extracted conditional features, and the specific process is as follows: s31, independent encoding of the multi-dimensional features is aligned with time: firstly, respectively carrying out coding and time dimension alignment processing on original features from different sources to ensure that all conditions are strictly synchronous at a frame level, and specifically comprising the following steps: (1) Tone feature extension-receiving 192-dimensional global voiceprint embedded vectors pre-extracted by a cam++ encoder Mapping it to the hidden dimension of the object through a learnable linear projection layer Copying and expanding the projection result in the time dimension to generate a time sequence tone characteristic sequence consistent with the number of target Mel frequency spectrum frames ; (2) Pitch characterization encoding receiving a discretized frame-level pitch sequence First mapping it into dense vector by a leachable embedded lookup table, performing context modeling via a multi-layer convolution heap, extracting pitch characterization with local smoothness and dynamic continuity Ensuring the smoothness and local correlation of pitch trajectories; (3) Content feature upsampling-receiving frame-level content features output by a pre-trained Conformer content encoder Since its original time resolution is usually lower than the target mel spectrum, an up-sampling network formed by cascade of a plurality of transposed convolution modules is introduced to accurately up-sample the low-frequency content features to the target frame length Obtaining an aligned content feature sequence ; (4) Effective speech region mask generation incorporating non-silence masking As an auxiliary condition, the mask is generated based on pitch characteristics, frames with effective (non-zero) fundamental frequency values are marked as effective, the rest frames are marked as mute, and the binary mask is expanded to form a timing sequence signal with the same dimension as other characteristics and is used for guiding the generation model to inhibit artifacts in a mute area and reconstruct focusing details in a sound area; s32, splicing and fusing multi-mode features: (1) Feature splicing and pre-normalization, namely splicing the aligned tone color features, content features, pitch features and effective area masks in the feature channel dimension to form a combined condition feature Then, pre-layer normalization is applied to the spliced features to eliminate dimension differences and distribution offsets among different feature sources, improve training stability and accelerate convergence; (2) And the unified space projection and the post normalization are that a linear projection is utilized to map the high-dimensional combined condition features to the preset hidden layer dimension, so that the depth fusion of heterogeneous features is realized, the information of different modes is interacted with each other, the input requirement of a subsequent generation model is adapted in the dimension, the projected features are subjected to secondary layer normalization, the feature distribution is further stabilized, and finally the output fusion condition sequence is directly used as the condition input of the stream matching model.
- 5. The method for beautifying the stream matching singing voice based on feature decoupling and masking reconstruction of claim 4, wherein the step S4 of randomly masking an original Mel-frequency spectrogram to obtain masking features comprises the following specific steps: Designing a dynamic masking training strategy: (1) Continuous segment masking strategy-length for input is Complete mel-frequency spectrogram Randomly selecting a time start and masking a continuous time frame, the length of the masking zone being scaled according to the total length of time, i.e Wherein the proportionality coefficient Sampling in a preset interval (such as [0.3, 0.5 ]), thereby obtaining a masked Mel spectrum ; (2) Masking signal construction, synchronous generation of a binary masking mark vector consistent with the time dimension of the mel spectrum Wherein, if the first The frame is not masked And, if you get The frame is masked, then The marker clearly indicates the region of the model to be reconstructed, enhances the perception of the missing position, and then And (3) with Masking features of a stitched build model 。
- 6. The method for beautifying the stream matching singing voice based on feature decoupling and masking reconstruction of claim 5, wherein the step S5 is characterized in that a stream matching generation model is constructed, the condition features, the masking features and the original Mel spectrogram input model are trained, and a trained stream matching model is obtained, and the specific process is as follows: Compared with the traditional diffusion model, the frame does not need a multi-step denoising process, can finish the mapping from noise to data only by single forward propagation, and has the advantages of more stable training, faster sampling speed, more efficient gradient propagation and the like: (1) Probability Path definition defining a Slave noise distribution Data distribution Probability path, for a slave uniform distribution Time step of mid-sampling Random noise consistent with the target frequency spectrum shape is sampled from standard normal components, and a noise added sample in an intermediate state is obtained through linear interpolation : Wherein the method comprises the steps of Is Gaussian noise when In the time-course of which the first and second contact surfaces, Is true data when In the time-course of which the first and second contact surfaces, Is pure noise; (2) Speed field prediction the core of the model is to train a parameterized speed field The function describes the instantaneous speed of evolution of the samples along the probability path, directing the samples from a straight line of data distribution to a noise distribution, and during training, the model requires masking for each sample And condition features For input, prediction from current state Reach to complete data Satisfies the Ordinary Differential Equation (ODE): to achieve this goal, a U-Net structure is used as a speed field predictor, comprising a downsampling encoder, an intermediate bottleneck layer and an upsampling decoder, time-embedded Broadcasting to all network layers as global scalar conditions; (3) The goal of the training is to minimize the Mean Square Error (MSE) between the predicted velocity field and the target velocity field, the loss function is defined as: The loss is calculated element by element on the whole Mel spectrogram, so that the model can be forced to accurately infer the evolution direction required by returning to the complete spectrum from the current noise state based on the known context, masking characteristics and multidimensional control conditions under any time step, and the model not only learns to generate high-quality Mel spectrum from pure noise but also has intelligent complementary capability of context perception under partial observation conditions through the training strategy.
- 7. The method for beautifying the stream matching singing voice based on feature decoupling and masking reconstruction of claim 6, wherein the step S6 is characterized in that mixed reasoning conditions are constructed, amateur singing voice is beautified based on a trained stream matching model, and the beautified Mel spectrogram is input into a vocoder to obtain the beautified singing voice, and the specific process is as follows: s61, mixing condition construction and time sequence splicing: (1) Masking the mel-spectrum structure by constructing a length-wise spliced mel-spectrum The first half is the real amateur singer mel frequency spectrum as the known acoustic context reference, and the second half is the beautified singer area to be generated; (2) Mask marker construction, construction of corresponding mask marker sequences Setting 1 for the first half indicates that the original is preserved, and setting 0 for the second half indicates that the model needs to be reconstructed in this region. Will be And Stitching to obtain masking features ; (3) The multi-dimensional mixed control condition construction comprises the steps of splicing condition characteristics of each dimension for keeping personal singing characteristics of amateur singers and realizing accurate pitch and rhythm of professional singers, embedding voiceprint of the same singer in the front and rear sections for tone conditions to ensure strict consistency of tone identities before and after beautification, using original pitch of amateur singers in the front half section for pitch conditions and professional standard pitch used in the rear half section for guiding models to output accurate and stable melody contours in a generation area, mapping content characteristics of amateur audios to a professional audio time shaft by a time warping algorithm for content conditions to ensure consistency of lyric content generated with original input, aligning rhythm rhythms with professional sings, marking effective sounding areas of amateur and professional sections respectively for effective masks, restraining artifacts in the silent section and focusing details in the voice sections for auxiliary models, integrating the four kinds of mixed condition characteristics according to a fusion flow as described in S32 to generate final control conditions As a core pilot signal for the stream matching model; s62, stream matching sampling generation: And solving a normal differential equation of stream matching by adopting an Euler method, so as to realize deterministic generation from random noise to a target Mel frequency spectrum: (1) Initialization by sampling initial noise from a standard normal distribution, discretizing the continuous time interval [0,1] The number of the normal steps is set to be 20-50; (2) Euler iterative update from To the point of Iterating, namely evolving from noise to data, and continuously updating the state by using an Euler method; (3) Intercepting and completing After the step of iteration, obtaining a complete reconstructed Mel frequency spectrum, and only intercepting the second half part as the final beautified Mel frequency spectrum output; S63, high-quality beautification audio synthesis: (1) Performing inverse normalization, namely performing inverse transformation of normalization operation in the step S12 to restore the numerical range of the beautified Mel frequency spectrum generated in the step S52 to the input domain expected by HiFiGAN vocoder; (2) Vocoder synthesis, namely inputting the anti-normalized Mel frequency spectrum into a pre-trained HiFiGAN generator to synthesize the final beautified singing voice audio with natural tone quality, abundant details and high fidelity.
Description
Stream matching singing beautifying method based on feature decoupling and masking reconstruction Technical Field The invention relates to a stream matching singing beautifying method based on feature decoupling and masking reconstruction. Background Singing voice is used as an important carrier for human emotion expression and artistic creation, and carries rich culture and connotation. However, for non-professional singers, due to lack of systematic vocal training, there are often obvious shortages in respiratory control, sounding stability, etc., so that the pitch is difficult to control accurately during singing, and the tone quality is often different from the ideal effect. Thus, despite numerous singers, many people still face the trouble of unsatisfactory singing effects. In this context, singing beautifying techniques exhibit a growing market potential in the fields of music production, multimedia entertainment, and the like. Whether the singing voice fine tuning in professional music production, dubbing singing of movie and television episodes or real-time trimming in online K song application and intelligent composition assisted demo generation, the singing voice beautifying technology plays a critical role and becomes a key bridge for connecting original creation and final artistic presentation. The traditional singing beautifying process is highly dependent on a professional tuning operator, complicated manual editing, pitch correction and effector debugging are needed by means of expensive software and hardware equipment, and the time and the labor are consumed, and the technical threshold is extremely high. Therefore, how to realize the singing beautification with automation, intellectualization and without losing artistic expression has become an important topic of common attention in academia and industry. The key goal of singing beautifying is to improve singing quality from two levels on the premise of not changing the semantic content of original singing and tone quality of singers, namely, correcting basic singing indexes including accuracy of pitch and rhythm, and enhancing artistic expressive force of sound, such as optimizing tone quality and improving breath control and singing skills. The existing singing beautifying method mainly focuses on the subtask of automatic pitch correction, and aims to correct the undesirable pitch in singing of amateur singers so as to enable the undesirable pitch to be more in line with an expected pitch curve, such as a target MIDI melody or a reference pitch of professional singing. However, such methods generally solve only the single dimension of pitch accuracy, and do not further consider and optimize other key artistic attributes of singing voice, such as fullness and purity of timbre, singing skills such as tremolo, breath, stability of tempo, continuity and expressivity of emotion, etc. A singing voice with accurate pitch and lacking expressive force often appears to be mechanical and rigid, and cannot meet the requirements of professional artistic creation or high-quality hearing experience. With the rapid progress of artificial intelligence generation technology, singing beautifying research is moving from single-dimensional modification to multi-dimensional, generation-type comprehensive beautifying paradigms. The existing method is mainly based on a variation self-encoder or a diffusion model to beautify two aspects of tone correction and tone quality enhancement. However, the existing generating beautifying method still has a plurality of limitations that firstly, the dependence on paired data is needed to carry out supervision training on strict amateur-professional parallel data pairs. The acquisition cost of the data is extremely high, professional singers are required to record amateur versions and professional versions of the same song content respectively, and the amateur versions and the professional versions are often difficult to acquire in large quantities in the real world, so that the generalization capability of the model is limited. Secondly, the problem of losing the singing style is solved, and the original personality characteristics of singers are easily modified or wiped off excessively while the professionals or idealization is pursued by the existing method, so that the beautified sound loses the identification degree and emotion authenticity. In the beautifying process, how to effectively keep the personal singing characteristics and styles of the original singer is a great challenge. Disclosure of Invention The invention aims to solve the defects of the existing singing beautifying method in aspects of poor comprehensive expressive force, dependence on paired data in model training, unnatural beautifying effect caused by easy loss of personalized singing style and the like, and provides a stream matching singing beautifying method based on feature decoupling and masking reconstruction. A stream matching singing beautif