CN-122024754-A - Audio separation method, device, computer device and storage medium

CN122024754ACN 122024754 ACN122024754 ACN 122024754ACN-122024754-A

Abstract

The present disclosure relates to an audio separation method, an audio separation device, a computer device and a storage medium. Relates to the field of deep learning, and solves the problem of difficult processing of complex and rapid-change music signals. The method comprises the steps of extracting time domain features and frequency domain features of original audio data, generating high-dimensional feature vectors which are suitable for an application scene according to the time domain features and the frequency domain features, wherein the high-dimensional feature vectors indicate contribution degrees of all features to audio separation, and carrying out sound source splitting on the high-dimensional feature vectors to obtain voice audio data and accompaniment audio data. The technical scheme provided by the disclosure is suitable for splitting multiple sound sources in audio, and realizes a light, accurate and rapid in-car complex music signal splitting mechanism.

Inventors

YANG YI
YAO LIN
ZHONG XU
HUANG XIAOFAN
YANG JIN

Assignees

奇瑞汽车股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260115

Claims (10)

1. An audio separation method, comprising: Extracting time domain features and frequency domain features of original audio data; Generating a high-dimensional feature vector which is suitable for an application scene according to the time domain features and the frequency domain features, wherein the high-dimensional feature vector indicates the contribution degree of each feature to audio separation; and carrying out sound source splitting on the high-dimensional feature vector to obtain voice audio data and accompaniment audio data.
2. The audio separation method according to claim 1, characterized in that the frequency domain features are extracted by: Dividing the original audio data into a plurality of short-time frames, wherein overlapping windows exist between adjacent short-time frames; and carrying out Fourier transform on each short-time frame according to the following expression to acquire a frequency domain signal of each short-time frame: , Wherein, the Represents the kth frequency component of the frequency domain, An nth said short-time frame signal representing the time domain, Representing a window function, N representing a frame length of the short-time frame; extracting amplitude information and/or phase information of frequency domain signals of each short-time frame by frame; and splicing the amplitude information and/or the phase information of each short-time frame in time sequence to acquire the frequency domain characteristics.
3. The audio separation method according to claim 1, wherein the step of generating a high-dimensional feature vector adapted to an application scene from the time-domain features and the frequency-domain features comprises: acquiring at least one time sequence vector of the time domain feature; Acquiring at least one frequency domain vector of the frequency domain features; splicing the time domain vector and the frequency domain vector to obtain a high-dimensional feature matrix; Generating weight of the high-dimensional feature matrix through an attention network, and reflecting contribution of each feature in the high-dimensional feature matrix to audio separation through the weight; applying the weight to the high-dimensional feature matrix, and fusing the weighted high-dimensional feature matrix according to the following expression to obtain the high-dimensional feature vector: , Wherein, the Is the high-dimensional feature vector obtained after fusion, Is the weighted high-dimensional feature matrix, [ i ]: represents M represents the total number of short time frames.
4. The audio separation method of claim 3, wherein the weights are attention weights, and the step of generating the weights of the high-dimensional feature matrix through an attention network comprises: acquiring a weight matrix; The attention weight is calculated according to the following expression: Attention(F)=softmax(WF+b), Wherein Attention (F) represents the Attention weight, W represents the weight matrix, F represents the high-dimensional feature matrix, b represents a bias vector, and softmax represents a normalization function.
5. The audio separation method according to claim 1, wherein the step of performing sound source splitting on the high-dimensional feature vector to obtain the human voice audio data and the accompaniment audio data comprises: Performing depth feature coding on the high-dimensional feature vector to obtain a deep feature vector; Carrying out sound source feature splitting on the deep feature vector to obtain a human voice feature matrix and/or an accompaniment feature matrix; and decoding the voice feature matrix to obtain voice data, and decoding the accompaniment feature matrix to obtain accompaniment data.
6. The audio separation method according to claim 5, wherein the step of performing sound source feature splitting on the deep feature vector to obtain a human voice feature matrix and/or an accompaniment feature matrix comprises: And splitting the deep feature vector containing the deep features of the mixed sound source through a semantic segmentation model to obtain a human voice feature matrix containing human voice information and an accompaniment feature matrix containing accompaniment information.
7. The audio separation method of claim 1, further comprising: updating a parameter configuration comprising at least any one or more of the following listed parameters: Parameters of the attention network that provide weights when generating the high-dimensional feature vector, Parameters of an encoder performing depth feature encoding in the sound source separation process, Parameters of a semantic segmentation model for performing sound source feature splitting in the sound source separation process, Parameters of a decoder that decodes during the sound source separation process.
8. An audio separation device, comprising: The multi-domain feature extraction module is used for extracting time domain features and frequency domain features of the original audio data; The feature fusion module is used for generating a high-dimensional feature vector which is suitable for an application scene according to the time domain features and the frequency domain features, wherein the high-dimensional feature vector indicates the contribution degree of each feature to audio separation; and the audio separation module is used for carrying out sound source separation on the high-dimensional feature vector to obtain voice audio data and accompaniment audio data.
9. A computer apparatus, comprising: A processor; A memory for storing processor-executable instructions; Wherein the processor is configured to perform the audio separation method of any one of claims 1 to 7.
10. A non-transitory computer-readable storage medium, wherein instructions in the storage medium, when executed by a processor of a computer, enable the computer to perform the audio separation method of any one of claims 1 to 7.

Description

Audio separation method, device, computer device and storage medium Technical Field The present disclosure relates to the field of deep learning, and in particular, to an audio separation method, an audio separation device, a computer device, and a storage medium. Background With the increasing demand of users for in-car entertainment functions, on-board wheat-free songs are becoming popular as an emerging entertainment mode. Traditional vehicle entertainment systems can only play music, but cannot provide a karaoke function. The microphone-free song separates accompaniment from voice in the song through an audio processing algorithm, so that a user can directly use the vehicle-mounted sound system to singe the song in the vehicle without holding a microphone, and a brand-new entertainment experience is provided. The commonly used accompaniment human voice separation algorithm mainly comprises a time domain algorithm and a frequency domain algorithm. The time domain algorithm focuses on the time sequence characteristics of the signals, directly processes the audio signals on a time axis, and is difficult to effectively separate accompaniment and human voice when processing complex music signals. The frequency domain algorithm can analyze the frequency components of an audio signal more finely by converting the audio signal into the frequency domain for processing, but it is difficult to accurately separate accompaniment and human voice when a rapidly varying music signal is processed. Therefore, it is difficult to process complex and rapidly varying music signals, both in the time domain and in the frequency domain, thereby affecting the effect of accompaniment separation. Disclosure of Invention To overcome the problems in the related art, the present disclosure provides an audio separation method, apparatus, computer apparatus, and storage medium. And combining the time domain features and the frequency domain features of the audio data to generate a high-dimensional feature vector which is suitable for the application scene, and splitting the sound source according to the high-dimensional feature vector. The method realizes a light, accurate and rapid in-car complex music signal splitting mechanism and solves the problem of difficult processing of complex and rapid-change music signals. According to a first aspect of embodiments of the present disclosure, there is provided an audio separation method, including: Extracting time domain features and frequency domain features of original audio data; Generating a high-dimensional feature vector which is suitable for an application scene according to the time domain features and the frequency domain features, wherein the high-dimensional feature vector indicates the contribution degree of each feature to audio separation; and carrying out sound source splitting on the high-dimensional feature vector to obtain voice audio data and accompaniment audio data. Further, the frequency domain features are extracted by: Dividing the original audio data into a plurality of short-time frames, wherein overlapping windows exist between adjacent short-time frames; fourier transforming each of the short time frames according to the following expression: , Wherein, the Represents the kth frequency component of the frequency domain,An nth said short-time frame signal representing the time domain,Representing a window function, N representing a frame length of the short-time frame; extracting amplitude information and/or phase information of frequency domain signals of each short-time frame by frame; and splicing the amplitude information and/or the phase information of each short-time frame in time sequence to acquire the frequency domain characteristics. Further, the step of generating a high-dimensional feature vector adapted to an application scene according to the time domain feature and the frequency domain feature includes: acquiring at least one time sequence vector of the time domain feature; Acquiring at least one frequency domain vector of the frequency domain features; splicing the time domain vector and the frequency domain vector to obtain a high-dimensional feature matrix; Acquiring the weight of the high-dimensional feature matrix through an attention network, and reflecting the contribution of each feature in the high-dimensional feature matrix to audio separation through the weight; applying the weight to the high-dimensional feature matrix, and fusing the weighted high-dimensional feature matrix according to the following expression to obtain the high-dimensional feature vector: , Wherein, the Is the high-dimensional feature vector obtained after fusion,Is the weighted high-dimensional feature matrix, [ i ]: representsM represents the total number of short time frames. Further, the weight is an attention weight, and the step of generating the weight of the high-dimensional feature matrix through an attention network includes: acquiring a weight matrix; The