Search

CN-121983079-A - Musical accompaniment separation method and system based on vocal part structure perception

CN121983079ACN 121983079 ACN121983079 ACN 121983079ACN-121983079-A

Abstract

The invention relates to the technical field of music source separation, in particular to a method and a system for separating musical accompaniment based on vocal part structure perception. The invention provides a musical accompaniment separation method and a system based on vocal structure perception, which are used for separating musical accompaniment, introducing a vocal structure perception mechanism and modeling the structural relation among different vocal parts in music, so as to obtain a cleaner and structurally clear main vocal part, and improving the usability and stability of a separation result in a downstream generated audio task.

Inventors

  • Weng Zhenqiang
  • CHEN GONGYU
  • CHEN ZIHAO
  • DING CHAOFAN

Assignees

  • 巨人移动技术有限公司

Dates

Publication Date
20260505
Application Date
20260206

Claims (8)

  1. 1. The musical accompaniment separation method based on the sounding structure perception is characterized by comprising the following steps of: s1, generating angles for data construction, and taking multi-track music data as model training data sources, wherein the multi-track music comprises a main singing track, an accompaniment track and a music accompaniment track; S2, designing an angle for a model structure: S21, taking a mixed music audio signal as input of a music accompaniment separation model, firstly obtaining time-frequency characteristic representation by time-frequency conversion of input audio, and sending the time-frequency characteristic representation to a frequency band independent encoder, wherein the frequency band independent encoder divides and encodes the time-frequency characteristic in a frequency band dimension to obtain characteristic representation corresponding to each frequency band; the characteristic representation encoded by the band independent encoder is input to the acoustic separation based backbone network for modeling, S23, the characteristics output by the vocal separation backbone network are respectively connected to a plurality of vocal prediction branches, and the vocal prediction branches are mutually independent and are used for respectively generating separation results corresponding to a main vocal part, an accompaniment vocal part and a musical accompaniment vocal part; s3, training the model: s31, in a model training stage, inputting the constructed multi-track music training data into the model for training; S32, in the training process, the mixed music audio is input into a sound separation backbone network for forward calculation after being coded by a time-frequency conversion and frequency band independent coder, after model forward calculation, prediction results of a main vocal part, an accompaniment vocal part and a music accompaniment vocal part are respectively output, and corresponding multi-track vocal part sound tracks are used as supervision signals, and sound separation reconstruction loss is calculated for the prediction results so as to update parameters; and S34, separating the musical accompaniment based on the trained model.
  2. 2. The musical accompaniment separating method based on vocal structure perception according to claim 1, wherein the manner of generating angles for data construction is as follows: s11, in the data construction stage, only screening multi-track music of a single main singing for training, and eliminating samples with multiple main sings sing by turns, antiphonal singing or unstable main singing identities; S12, when the speaker code is extracted, as the main vocal part exists in the multi-track data in the form of independent audio tracks, the main audio track is directly taken as a speaker characteristic extraction object, and the context perception mask speaker code model is adopted to carry out code extraction on the main audio track.
  3. 3. The vocal structural awareness based musical accompaniment separation method of claim 1 wherein the vocal structural backbone network structurally blends global speaker codes as conditional information to participate in feature computation.
  4. 4. The musical accompaniment separating method based on vocal structure perception according to claim 1, further comprising the steps of: And S24, establishing an auxiliary structure module connected only in a training stage, wherein the auxiliary structure module is used for carrying out structural modeling on the output result of the prediction branch of the sound part, and consists of a plurality of discrimination sub-networks, specifically comprises a multicycle discriminator, a multiscale discriminator and a multi-resolution frequency spectrum discriminator, and is used for respectively carrying out modeling and grading on the output result of the prediction branch of the sound part from different periodic characteristics, different time scales and different frequency spectrum resolutions.
  5. 5. The method for separating musical accompaniment based on vocal structure perception according to claim 4, wherein in the training process of S32, the speaker code extracted from the main track is injected into the vocal separation backbone network as global condition information, specifically, the central layer feature of the vocal separation backbone network is acted by a condition modulation method, and the global condition information is discarded with a preset probability.
  6. 6. The method for separating musical accompaniment based on vocal structure perception according to claim 5, wherein in the training process of S32, the auxiliary structure module constructs the vocal prediction branch output result obtained by model prediction and the corresponding real vocal track into a plurality of vocal combination forms according to a preset rule, and models the vocal combination audio composed of the vocal prediction branch output result and the corresponding reference vocal combination audio input discrimination sub-network.
  7. 7. The musical accompaniment separation method based on vocal structure perception according to claim 6, wherein the vocal combination form includes combining a main vocal part and a real musical accompaniment vocal part in a vocal prediction branch output result and combining the real main vocal part and the real musical accompaniment vocal part as corresponding reference combinations, and combining a main vocal part and a real accompaniment vocal part in a vocal prediction branch output result or combining a real main vocal part and an accompaniment vocal part in a vocal prediction branch output result and combining a real main vocal part and a real accompaniment vocal part as corresponding reference combinations, And meanwhile, the sound separation main network updates parameters of the sound separation main network according to the output result of the judging sub-network and an optimization target different from the judging sub-network, thereby realizing the joint training of the sound separation main network and the auxiliary structure module.
  8. 8. A musical accompaniment separation system based on vocal structure perception, characterized in that the musical accompaniment separation system is established by adopting the musical accompaniment separation method according to any one of claims 1-7, and the musical accompaniment separation is realized.

Description

Musical accompaniment separation method and system based on vocal part structure perception Technical Field The invention relates to the technical field of music source separation, in particular to a method and a system for separating musical accompaniment based on vocal part structure perception. Background The current mainstream music accompaniment separation method is mostly based on a deep learning model, and the separation of human voice and accompaniment is realized by carrying out end-to-end modeling on mixed audio. Although this type of method achieves a certain effect on the signal reconstruction index, the following disadvantages still exist: In the existing musical accompaniment separation technology, generally, only music signals are simply divided into two types of sound sources of 'voice' and 'accompaniment', and structural differences between a main song and a harmony contained in the voice are ignored. In actual music production and application scenarios, the main singing usually bears the main semantic and melody information, while harmony is used to enhance the overall listening feel and music thickness. In the prior art, the main singing and the voice are often mixed into a single vocal track in the separation process, so that obvious vocal part aliasing problems (still containing a large amount of voice residues) exist in the separation result, further, the requirements on the integrity and consistency of the main singing sequence in the subsequent application are affected, and for example, the problems of easy interference in tasks such as singing voice conversion, pitch extraction, semantic alignment and the like are easily generated. The separation result is difficult to meet the requirements of downstream tasks, namely, in the applications of singing voice conversion, pitch estimation, mouth shape driving and the like, the requirements on the time sequence consistency and the semantic purity of a main singing sequence are higher, and the human voice result generated by the existing separation method is easy to introduce interference, so that the overall generation effect is influenced. There is a lack of constraint mechanisms for vocal structural consistency. Most of the prior art only depends on waveform or spectrum level reconstruction errors to optimize, and the relation between different sound parts is not constrained from the music structure angle, so that the problem of sound part role confusion is easy to occur. Therefore, it is necessary to provide a method and a system for separating musical accompaniment based on vocal structure perception, which can bind and optimize the musical accompaniment separation process by combining vocal structure information. Disclosure of Invention The invention aims to provide a musical accompaniment separation method and a system based on vocal part structure perception, which can be used for restraining and optimizing a musical accompaniment separation process by combining vocal part structure information. In order to solve the problems in the prior art, the invention provides a musical accompaniment separation method based on vocal part structure perception, which comprises the following steps: s1, generating angles for data construction, and taking multi-track music data as model training data sources, wherein the multi-track music comprises a main singing track, an accompaniment track and a music accompaniment track; S2, designing an angle for a model structure: S21, taking a mixed music audio signal as input of a music accompaniment separation model, firstly obtaining time-frequency characteristic representation by time-frequency conversion of input audio, and sending the time-frequency characteristic representation to a frequency band independent encoder, wherein the frequency band independent encoder divides and encodes the time-frequency characteristic in a frequency band dimension to obtain characteristic representation corresponding to each frequency band; the characteristic representation encoded by the band independent encoder is input to the acoustic separation based backbone network for modeling, S23, the characteristics output by the vocal separation backbone network are respectively connected to a plurality of vocal prediction branches, and the vocal prediction branches are mutually independent and are used for respectively generating separation results corresponding to a main vocal part, an accompaniment vocal part and a musical accompaniment vocal part; s3, training the model: s31, in a model training stage, inputting the constructed multi-track music training data into the model for training; S32, in the training process, the mixed music audio is input into a sound separation backbone network for forward calculation after being coded by a time-frequency conversion and frequency band independent coder, after model forward calculation, prediction results of a main vocal part, an accompaniment vocal part and a music accompanime