CN-121983077-A - Audio separation method, system and related device

CN121983077ACN 121983077 ACN121983077 ACN 121983077ACN-121983077-A

Abstract

The application discloses an audio separation method, an audio separation system and a related device, the method comprises the steps of obtaining mixed audio, determining mixed frequency domain signals corresponding to the mixed audio, wherein the mixed audio comprises target audio of different target objects, obtaining frequency distribution characteristics corresponding to the mixed frequency domain signals based on frequency distribution information corresponding to the mixed frequency domain signals, carrying out characteristic extraction on the frequency distribution characteristics in combination with time dimension and frequency dimension to obtain target extraction characteristics matched with the frequency distribution characteristics, obtaining fundamental frequency difference information and formant difference information in the target extraction characteristics, determining an audio mask matched with at least one target object based on the fundamental frequency difference information and the formant difference information, and obtaining target audio matched with each target object based on the audio mask and the mixed frequency domain signals. Through the mode, the method and the device can improve the accuracy and the quality of audio separation.

Inventors

LIAO YU
MA FENG
GAO JIANQING

Assignees

科大讯飞股份有限公司

Dates

Publication Date: 20260505
Application Date: 20251217

Claims (10)

1. An audio separation method, comprising: acquiring mixed audio, and determining a mixed frequency domain signal corresponding to the mixed audio, wherein the mixed audio comprises target audio of different target objects; obtaining frequency distribution characteristics corresponding to the mixed frequency domain signals based on the frequency distribution information corresponding to the mixed frequency domain signals; Performing feature extraction on the frequency distribution features by combining a time dimension and a frequency dimension to obtain target extraction features matched with the frequency distribution features; Acquiring fundamental frequency difference information and formant difference information in the target extraction features, and determining an audio mask matched with at least one target object based on the fundamental frequency difference information and the formant difference information; And acquiring target audio matched with each target object based on the audio mask and the mixed frequency domain signal.
2. The audio separation method according to claim 1, wherein the obtaining the frequency distribution characteristic corresponding to the mixed frequency domain signal based on the frequency distribution information corresponding to the mixed frequency domain signal includes: Acquiring frequency distribution information corresponding to the mixed frequency domain signal, and dividing the mixed frequency domain signal into a plurality of sections of frequency domain sub-signals based on the frequency distribution information; And acquiring frequency domain sub-features corresponding to each section of the frequency domain sub-signals, and fusing the frequency domain sub-features corresponding to all the frequency domain sub-signals to obtain the frequency distribution features.
3. The audio separation method according to claim 1, wherein the combining the time dimension and the frequency dimension performs feature extraction on the frequency distribution feature to obtain a target extraction feature matched with the frequency distribution feature, and the method comprises: performing feature extraction of time dimension on the frequency distribution feature to obtain a time domain extraction feature; And obtaining the target extraction feature based on feature extraction of the frequency dimension of the time domain extraction feature.
4. The audio separation method according to claim 3, wherein the obtaining the time domain extraction feature based on the feature extraction of the frequency distribution feature in the time dimension includes: acquiring frequency distribution characteristics after time dimension normalization; Performing feature extraction of time dimension on the normalized frequency distribution features to obtain first reference features; fusing the current frequency distribution characteristic and the first reference characteristic to obtain the time domain extraction characteristic; The obtaining the target extraction feature based on the feature extraction of the frequency dimension of the time domain extraction feature comprises the following steps: Acquiring time domain extraction characteristics after frequency dimension normalization; performing feature extraction of frequency dimension on the normalized time domain extraction features to obtain second reference features; And fusing the current time domain extraction feature and the second reference feature to obtain the target extraction feature.
5. The audio separation method according to claim 1, wherein the acquiring fundamental frequency difference information and formant difference information in the target extraction features, and determining an audio mask matching at least one of the target objects based on the fundamental frequency difference information and the formant difference information, comprises: Inputting the target extracted features to a constructed feature perception module, and outputting target difference features related to the fundamental frequency difference information and the formant difference information based on an attention mechanism by utilizing the feature perception module; the target difference feature is input to a classification module coupled to the feature perception module, with the classification module outputting the audio mask that matches at least one of the target objects.
6. The audio separation method of claim 5, wherein the classification module comprises a splitting network and a classification network coupled to each other, wherein the inputting the target difference feature to the classification module coupled to the feature perception module, outputting the audio mask matching at least one of the target objects using the classification module, comprises: dividing the target difference characteristic into a plurality of difference sub-characteristics based on a preset dividing strategy by using the splitting network, wherein the preset dividing strategy is related to the frequency distribution information; Inputting each difference sub-feature into a corresponding classification network to obtain a classification result output by the classification network; Based on all the classification results, the audio mask matching at least one of the target objects is obtained.
7. The audio separation method of claim 5, wherein the inputting the object extraction feature to the constructed feature perception module, outputting, with the feature perception module, object difference features related to the fundamental frequency difference information and the formant difference information based on an attention mechanism, further comprises: judging whether the target difference feature output by the feature perception module meets a preset condition or not, wherein the preset condition is related to the output times of the target difference feature; And in response to the currently output target difference feature not meeting the preset condition, updating the currently output target difference information into a frequency distribution feature, and returning to the step of combining the time dimension and the frequency dimension to perform at least one round of feature extraction on the frequency distribution feature to obtain a target extraction feature matched with the frequency distribution feature.
8. An audio separation system, comprising: The system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring mixed audio and determining a mixed frequency domain signal corresponding to the mixed audio, wherein the mixed audio comprises target audio of different target objects; the first extraction module is used for obtaining frequency distribution characteristics corresponding to the mixed frequency domain signals based on the frequency distribution information corresponding to the mixed frequency domain signals; the second extraction module is used for carrying out feature extraction on the frequency distribution features by combining a time dimension and a frequency dimension to obtain target extraction features matched with the frequency distribution features; the processing module is used for acquiring fundamental frequency difference information and formant difference information in the target extraction characteristics, and determining an audio mask matched with at least one target object based on the fundamental frequency difference information and the formant difference information; And the separation module is used for acquiring target audio matched with each target object based on the audio mask and the mixed frequency domain signal.
9. An electronic device comprising a memory and a processor coupled to each other, the memory having program instructions stored therein, the processor for executing the program instructions to implement the method of any of claims 1-7.
10. A computer readable storage medium having stored thereon program instructions, which when executed by a processor, implement the method of any of claims 1-7.

Description

Audio separation method, system and related device Technical Field The present application relates to the field of audio processing technologies, and in particular, to an audio separation method, system, and related apparatus. Background In the task of voice separation of mixed audio, the core idea of the traditional technology is to distinguish by utilizing the inherent difference of the voices of different target objects in physical characteristics, namely, depending on preset frequency domain cut-off frequency. By setting a fixed frequency threshold, the high frequency and low frequency components of the mixed signal are separated, and sounds of different target objects are obtained. This approach relies on a frequency threshold for separation resulting in lower separation accuracy. In view of this, how to improve the accuracy and quality of audio separation is a problem to be solved. Disclosure of Invention The application mainly solves the technical problem of providing an audio separation method, an audio separation system and a related device, which can improve the accuracy and the quality of audio separation. The technical scheme includes that mixed audio is obtained, mixed frequency domain signals corresponding to the mixed audio are determined, the mixed audio comprises target audio of different target objects, frequency distribution characteristics corresponding to the mixed frequency domain signals are obtained based on frequency distribution information corresponding to the mixed frequency domain signals, characteristic extraction is conducted on the frequency distribution characteristics through combination of time dimension and frequency dimension to obtain target extraction characteristics matched with the frequency distribution characteristics, fundamental frequency difference information and formant difference information in the target extraction characteristics are obtained, an audio mask matched with at least one target object is determined based on the fundamental frequency difference information and the formant difference information, and target audio matched with each target object is obtained based on the audio mask and the mixed frequency domain signals. In order to solve the technical problems, the other technical scheme adopted by the application is that an audio separation system is provided, which comprises an acquisition module, a first extraction module, a second extraction module and a processing module, wherein the acquisition module is used for acquiring mixed audio and determining mixed frequency domain signals corresponding to the mixed audio, the mixed audio comprises target audio of different target objects, the first extraction module is used for acquiring frequency distribution characteristics corresponding to the mixed frequency domain signals based on frequency distribution information corresponding to the mixed frequency domain signals, the second extraction module is used for carrying out characteristic extraction on the frequency distribution characteristics by combining a time dimension and a frequency dimension to acquire target extraction characteristics matched with the frequency distribution characteristics, the processing module is used for acquiring fundamental frequency difference information and formant difference information in the target extraction characteristics, and determining an audio mask matched with at least one target object based on the fundamental frequency difference information and the formant difference information, and the separation module is used for acquiring target audio matched with each target object based on the audio mask and the mixed frequency domain signals. In order to solve the technical problem, another technical scheme adopted by the application is to provide electronic equipment which comprises a memory and a processor, wherein the memory and the processor are mutually coupled, the memory stores program instructions, and the processor is used for executing the program instructions to realize the method in the technical scheme. The application has the beneficial effects that the audio frequency separation method is different from the situation in the prior art, after the mixed audio frequency comprising different target audio frequencies is obtained, the mixed audio frequency is converted into the mixed frequency domain signal, and the corresponding frequency distribution characteristics are obtained by extracting according to the mixed frequency domain signal. And carrying out feature extraction on the frequency distribution features by combining the time dimension and the frequency domain dimension so as to carry out state analysis on the mixed audio under the time frequency, thereby obtaining the target extraction features with stronger feature expression capability. And extracting fundamental frequency difference information and formant difference information according to the target extraction characteristics. And mean