CN-121999796-A - Sound source separation method and device

CN121999796ACN 121999796 ACN121999796 ACN 121999796ACN-121999796-A

Abstract

The embodiment of the application provides a sound source separation method and device, and relates to the technical field of data processing. The method comprises the steps of converting an audio signal to be separated into a time-frequency domain signal from a time-domain signal, carrying out frequency band segmentation on the time-frequency domain signal to divide the time-frequency domain signal into a plurality of sub-band signals, wherein frequency bands of the sub-band signals are not overlapped, respectively obtaining frequency spectrum characteristics of the sub-band signals, obtaining a frequency spectrum mask of at least one sound source of the audio signal to be separated according to the frequency spectrum characteristics of the sub-band signals, and obtaining the audio signal of at least one sound source according to the frequency spectrum mask of the at least one sound source and the time-frequency domain signal. The embodiment of the application is used for improving the robustness of the sound source separation algorithm.

Inventors

XIA XIANJUN
ZHANG ZIHAN
HUANG CHUANZENG

Assignees

北京字跳网络技术有限公司

Dates

Publication Date: 20260508
Application Date: 20241108

Claims (17)

1. An audio separation method, comprising: transforming the audio signal to be separated from a time domain signal into a time-frequency domain signal; performing frequency band division on the time-frequency domain signal to divide the time-frequency domain signal into a plurality of sub-band signals, wherein the frequency bands of the plurality of sub-band signals are not overlapped; Respectively acquiring the frequency spectrum characteristics of the plurality of sub-band signals; Acquiring a spectrum mask of at least one sound source of the audio signal to be separated according to the spectrum characteristics of the plurality of subband signals; and acquiring an audio signal of at least one sound source according to the frequency spectrum mask of the at least one sound source and the time-frequency domain signal.
2. The method according to claim 1, wherein said transforming the audio signal to be separated from a time domain signal into a time-frequency domain signal comprises: And carrying out short-time Fourier transform (STFT) on the audio signal to be separated so as to transform the audio signal to be separated from a time domain signal into a time-frequency domain signal.
3. The method of claim 1, wherein the performing band division on the time-frequency domain signal to divide the time-frequency domain signal into a plurality of subband signals comprises: Performing frequency band division on a first frequency band of the time-frequency domain signal based on a first frequency band interval, and performing frequency band division on a second frequency band of the time-frequency domain signal based on a second frequency band interval; the maximum frequency of the first frequency band is smaller than the minimum frequency of the second frequency band, and the first frequency band interval is smaller than the second frequency band interval.
4. The method of claim 1, wherein the separately obtaining spectral features of the plurality of subband signals comprises: respectively extracting the characteristics of the plurality of sub-band signals to obtain the sub-band characteristics of each sub-band signal; Stacking sub-band features of each sub-band signal to obtain stacking features; And acquiring the frequency spectrum characteristics of each sub-band signal according to the stacking characteristics.
5. The method of claim 4, wherein the performing feature extraction on the plurality of subband signals to obtain subband features of each subband signal comprises: Extracting the characteristics of each sub-band signal through a multi-layer perceptron MLP corresponding to each sub-band signal so as to obtain the sub-band characteristics of each sub-band signal; the MLP corresponding to each sub-band signal consists of a root mean square normalization layer and a linear layer which are sequentially connected in series.
6. The method of claim 4, wherein said obtaining spectral features of each subband signal from the stacking features comprises: modeling each sub-band feature of the stacked feature in a time dimension by a local feature extraction module to obtain a local feature composed of time sequence features of each sub-band signal; performing transposition operation on the local features to obtain first transposition features; Modeling each time sequence feature of the first transfer feature in a feature stacking dimension through a global feature extraction module to obtain a global feature composed of frequency band features of each sub-band signal; performing transposition operation on the global features to obtain second transposition features; feature fusion is carried out on the second transposed feature through a multi-head self-attention mechanism so as to obtain fusion features; and splitting the fusion characteristic to obtain the spectrum characteristic of each sub-band signal.
7. The method of claim 6, wherein the local feature extraction module and/or the global feature extraction module is a dual path Mamba module, the dual path Mamba module comprising: A first path including a first Mamba block, a first root mean square normalization layer, and a first adder, the input of the first Mamba block being the input of the dual path Mamba block, the input of the first root mean square normalization layer being the output of the first Mamba block, the input of the first adder being the output of the first Mamba block and the output of the first root mean square normalization layer, the output of the first adder being the output of the first path; A second path including a flip layer, a second Mamba block, a second root mean square normalization layer, and a second adder, where the flip layer is configured to flip an input of the dual-path Mamba module, the input of the second Mamba block is an output of the flip layer, the input of the second root mean square normalization layer is an output of the second Mamba block, the input of the second adder is an output of the second Mamba block and an output of the second root mean square normalization layer, and the output of the second adder is an output of the second path; the splicing layer is used for splicing the output of the first path and the output of the second path; and the input of the linear layer is the output of the splicing layer, and the output of the linear layer is the output of the dual-path Mamba module.
8. The method according to claim 1, wherein said obtaining a spectral mask of at least one sound source of the audio signal to be separated from spectral features of the respective subband signals comprises: Processing the spectrum characteristics of each sub-band signal through a corresponding mask estimation module for each sound source respectively to obtain spectrum masks of each sub-band signal corresponding to the sound source, and splicing the spectrum masks of each sub-band signal corresponding to the sound source to obtain the spectrum masks of the sound source; The mask estimation module corresponding to each sub-band consists of an MLP and a gating linear unit GLU which are sequentially connected in series, wherein the MLP consists of a root mean square normalization layer and a linear layer which are sequentially connected in series.
9. The method of claim 1, wherein the obtaining the audio signal of the at least one sound source from the spectral mask of the at least one sound source and the time-frequency domain signal comprises: calculating the product of the time-frequency domain signal and the frequency spectrum mask of each sound source respectively to obtain the time-frequency domain signal of each sound source; and respectively converting the time-frequency domain signals of each sound source into time domain signals to acquire the audio signals of each sound source.
10. The method of claim 9, wherein the transforming the time-frequency domain signals of the respective audio sources into time-domain signals, respectively, comprises: and respectively performing inverse short time Fourier transform ISTFT on the time-frequency domain signals of each sound source to transform the time-frequency domain signals of each sound source into time domain signals.
11. The method according to any one of claims 1-10, wherein the audio separation method is implemented based on a sound source separation model, the sound source separation model comprising: A conversion module for performing the step of converting the audio signal to be separated from a time domain signal into a time-frequency domain signal; a dividing module, configured to perform a step of dividing the time-frequency domain signal into a plurality of subband signals; An acquisition module, configured to perform a step of acquiring spectral features of the plurality of subband signals, respectively; a mask estimation module, configured to perform a step of acquiring a spectrum mask of at least one sound source of the audio signal to be separated according to spectrum features of the plurality of subband signals; And the output module is used for executing the step of acquiring the audio signal of the at least one sound source according to the frequency spectrum mask of the at least one sound source and the time-frequency domain signal.
12. The method of claim 11, wherein prior to implementing the audio separation method based on the sound source separation model, the method further comprises: The method comprises the steps of obtaining a training data set, wherein the training data set comprises a plurality of groups of training data, and any group of training data comprises a sample audio signal and a reference audio signal of at least one sound source corresponding to the sample audio signal; And training the sound source separation model based on the training data set.
13. The method of claim 12, wherein the training the sound source separation model based on the training data set comprises: inputting the sample audio signal into the sound source separation model, and acquiring a predicted audio signal of at least one sound source corresponding to the sample audio signal output by the sound source separation model and a time-frequency domain signal of the predicted audio signal of at least one sound source corresponding to the sample audio signal; Acquiring a time-frequency domain signal of a reference audio signal of at least one sound source corresponding to the sample audio signal; Calculating a loss value corresponding to the sample audio signal according to a reference audio signal of at least one sound source corresponding to the sample audio signal, a predicted audio signal of at least one sound source corresponding to the sample audio signal, a time-frequency domain signal of the predicted audio signal of at least one sound source corresponding to the sample audio signal and a time-frequency domain signal of the reference audio signal of at least one sound source corresponding to the sample audio signal; and updating the model parameters of the sound source separation model according to the loss value.
14. A sound source separation apparatus, comprising: a transformation unit for transforming the audio signal to be separated from a time domain signal into a time-frequency domain signal; A dividing unit configured to perform band division on the time-frequency domain signal to divide the time-frequency domain signal into a plurality of subband signals, where frequency bands of the plurality of subband signals do not overlap; An acquisition unit, configured to acquire spectral features of the plurality of subband signals respectively; The processing unit is used for acquiring a spectrum mask of at least one sound source of the audio signal to be separated according to the spectrum characteristics of the plurality of subband signals; And the output unit is used for acquiring the audio signal of the at least one sound source according to the frequency spectrum mask of the at least one sound source and the time-frequency domain signal.
15. An electronic device comprising a memory and a processor, the memory for storing a computer program, the processor for causing the electronic device to implement the sound source separation method of any one of claims 1-13 when the computer program is executed.
16. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program, which when executed by a computing device, causes the computing device to implement the sound source separation method of any one of claims 1 to 13.
17. A computer program product, characterized in that the computer program product, when run on a computer, causes the computer to implement the sound source separation method of any one of claims 1-13.

Description

Sound source separation method and device Technical Field The present application relates to the field of audio processing technologies, and in particular, to a method and an apparatus for separating audio sources. Background Sound source separation, also known as audio signal source separation or audio source separation, is an audio processing technique that aims to separate individual sound source components from a mixed audio signal. Traditional audio separation schemes rely primarily on signal processing techniques such as only-component analysis. The schemes extract the characteristics of one or more sound sources by carrying out frequency domain analysis or time-frequency domain analysis on the audio signals, thereby realizing the separation of the sound sources. However, these schemes often have difficulty achieving desirable separation effects for complex audio signals. With the development of machine learning and deep learning technologies, data-driven schemes have made remarkable progress in the field of sound source separation. By training the deep neural network (Deep Neural Network, DNN), features and patterns of sound source separation can be learned from a large amount of audio data, thereby achieving more accurate and efficient sound source separation. However, these algorithms exhibit insufficient robustness in separating audio sources for different types of audio signals. For example, if a certain audio signal is not included in the training data set, the sound source separation effect of the algorithm on that type of audio signal may be significantly reduced. Disclosure of Invention In view of this, the embodiment of the application provides a method and a device for separating sound sources, which are used for improving the robustness of a sound source separation algorithm. In order to achieve the above object, the embodiment of the present application provides the following technical solutions: In a first aspect, an embodiment of the present application provides a method for separating sound sources, including: transforming the audio signal to be separated from a time domain signal into a time-frequency domain signal; performing frequency band division on the time-frequency domain signal to divide the time-frequency domain signal into a plurality of sub-band signals, wherein the frequency bands of the plurality of sub-band signals are not overlapped; Respectively acquiring the frequency spectrum characteristics of the plurality of sub-band signals; Acquiring a spectrum mask of at least one sound source of the audio signal to be separated according to the spectrum characteristics of the plurality of subband signals; and acquiring an audio signal of at least one sound source according to the frequency spectrum mask of the at least one sound source and the time-frequency domain signal. As an optional implementation manner of the embodiment of the present application, the transforming the audio signal to be separated from a time domain signal into a time-frequency domain signal includes: and performing short-time Fourier transform on the audio signal to be separated so as to transform the audio signal to be separated from a time domain signal into a time-frequency domain signal. As an optional implementation manner of the embodiment of the present application, the performing band division on the time-frequency domain signal to divide the time-frequency domain signal into a plurality of subband signals includes: Performing frequency band division on a first frequency band of the time-frequency domain signal based on a first frequency band interval, and performing frequency band division on a second frequency band of the time-frequency domain signal based on a second frequency band interval; the maximum frequency of the first frequency band is smaller than the minimum frequency of the second frequency band, and the first frequency band interval is smaller than the second frequency band interval. As an optional implementation manner of the embodiment of the present application, the acquiring spectral features of the plurality of subband signals respectively includes: respectively extracting the characteristics of the plurality of sub-band signals to obtain the sub-band characteristics of each sub-band signal; Stacking sub-band features of each sub-band signal to obtain stacking features; And acquiring the frequency spectrum characteristics of each sub-band signal according to the stacking characteristics. As an optional implementation manner of the embodiment of the present application, the extracting features of the plurality of subband signals to obtain subband features of each subband signal includes: Extracting the characteristics of each sub-band signal through a multi-layer perceptron corresponding to each sub-band signal so as to obtain the sub-band characteristics of each sub-band signal; the multi-layer perceptron corresponding to each sub-band signal consists of a root mean square no