CN-122024755-A - Target sound extraction method and device based on state space model and related equipment

CN122024755ACN 122024755 ACN122024755 ACN 122024755ACN-122024755-A

Abstract

The invention discloses a method, a device and related equipment for extracting target sound based on a state space model, wherein the method comprises the steps of obtaining a mixed audio signal containing target sound and priori condition information of the target sound, and encoding the mixed audio signal to obtain audio encoding characteristics; the method comprises the steps of carrying out long time sequence dependency modeling on audio coding features by using a state space model to obtain time sequence modeling features, generating a condition embedded vector corresponding to target sound based on priori condition information, carrying out fusion processing on the time sequence modeling features and the condition embedded vector to generate a mask for distinguishing the target sound, screening the audio coding features by using the mask to obtain enhanced target sound features, decoding the target sound features, and reconstructing a time domain signal of the target sound. By introducing a state space model to perform long-time sequence modeling, the capability of accurately separating target sound in real time in a complex acoustic environment is improved on the premise of reducing the complexity and the calculation cost of the model.

Inventors

LIU YONG
WANG SHUAIWEI
ZHENG LEI

Assignees

华南师范大学

Dates

Publication Date: 20260512
Application Date: 20260130

Claims (10)

1. A method for extracting a target sound based on a state space model, the method comprising: Acquiring a mixed audio signal containing target sound and priori condition information of the target sound, and encoding the mixed audio signal to obtain audio encoding characteristics; Performing long time sequence dependency modeling on the audio coding features by using a state space model to obtain time sequence modeling features, and generating a condition embedded vector corresponding to the target sound based on the prior condition information; Fusing the time sequence modeling features and the conditional embedding vectors to generate masks for distinguishing target sounds; screening the audio coding features by using the mask to obtain enhanced target sound features; And decoding the target sound characteristics to reconstruct a time domain signal of the target sound, wherein the coding network and the decoding network adopt a causal structure to ensure that the processing is only dependent on current and historical information at any moment.
2. The method for extracting target sound based on state space model according to claim 1, wherein the encoding the mixed audio signal, extracting audio coding features, comprises: and carrying out layer-by-layer feature extraction and downsampling on the mixed audio signal through a multi-layer causal convolution network to obtain the audio coding features.
3. The method of claim 2, wherein the state space model is a parameterized state space model, the time-series evolution of audio features is modeled by a state transition equation, and the time-series modeled features are fused with the audio coding features by residual connections.
4. The method for extracting a target sound based on a state space model according to claim 1, wherein the generating a conditional embedding vector corresponding to the target sound based on the prior condition information comprises: and converting the class label of the target sound into a conditional embedding vector consistent with the dimension of the audio coding characteristic through an embedding mapping layer.
5. The method for extracting a target sound based on a state space model according to claim 1, wherein the fusing the time-series modeling feature with the conditional embedded vector to generate a mask for distinguishing the target sound comprises: injecting the conditional embedding vector into the time sequence modeling feature in a mode of element-by-element modulation or feature mapping to generate a conditional guidance feature; The conditional guidance feature is input to a mask prediction network to generate a mask for distinguishing the target sound.
6. The method for extracting target sound based on a state space model according to claim 1, wherein the filtering the audio coding feature with the mask to obtain the enhanced target sound feature comprises: and multiplying the mask and the audio coding feature element by element to inhibit non-target sound feature components and obtain enhanced target sound features.
7. The method for extracting a target sound based on a state space model according to claim 1, wherein decoding the target sound features to reconstruct a time domain signal of the target sound comprises: Performing feature reconstruction by adopting an up-sampling network symmetrical to the coding network structure; And combining jump connection to fuse the characteristic information of the corresponding coding layer so as to recover the time resolution and reconstruct the time domain signal.
8. A state space model-based target sound extraction apparatus, the apparatus comprising: the acquisition module is used for acquiring a mixed audio signal containing target sound and priori condition information of the target sound, and encoding the mixed audio signal to obtain audio encoding characteristics; The first generation module is used for carrying out long time sequence dependency modeling on the audio coding features by using a state space model to obtain time sequence modeling features, and generating a conditional embedding vector corresponding to the target sound based on the prior condition information; The second generation module is used for fusing the time sequence modeling characteristics with the conditional embedding vector to generate a mask for distinguishing target sound; the obtaining module is used for screening the audio coding features by using the mask to obtain enhanced target sound features; The reconstruction module is used for decoding the target sound characteristics and reconstructing a time domain signal of the target sound, wherein the coding network and the decoding network adopt a causal structure to ensure that the processing is only dependent on current and historical information at any moment.
9. A computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the steps of the method according to any one of claims 1 to 7.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method according to any one of claims 1 to 7 when the program is executed.

Description

Target sound extraction method and device based on state space model and related equipment Technical Field The present invention relates to the field of audio signal processing technologies, and in particular, to a method and apparatus for extracting a target sound based on a state space model, and a related device. Background Target sound extraction techniques are key research directions in the field of machine hearing and audio signal processing, the goal of which is to separate and enhance specific target sound signals in complex acoustic environments with multiple sound sources intermixed to simulate the "cocktail party effect" of the human auditory system. The technology has wide application prospect in the aspects of voice communication, intelligent hearing aid, environmental sound event detection, interactive audio application and the like. Existing target sound extraction methods rely mainly on deep learning, especially based on encoder-decoder architecture. In these methods, the model typically takes as input a mixed audio waveform or time-frequency signature and introduces a priori information (e.g., a piece of example audio or class labels) of the target sound source as conditional input to guide the model into focus on the target sound source. In order to model the time sequence dependency in the audio signal, the prior art generally adopts a cyclic neural network, a long-short-time memory network or a transducer as a core sequence modeling module. However, these mainstream schemes have inherent limitations in that loop structure-based models are difficult to effectively capture long-term dependencies when processing long-term audio, while transform models, while they are powerful in modeling capability, have a square-scale increase in computational complexity of their self-attention mechanisms with sequence length, resulting in huge computational overhead and memory pressure when processing long-term audio. In order to pursue higher separation precision, the existing advanced model is always deepened and widened in network, so that the model parameter is huge, the reasoning delay is obviously increased, and the actual deployment and application of the model in embedded equipment, mobile terminals or scenes needing real-time processing are severely restricted. Although state space models have recently received attention in time series signal processing tasks and began to be explored for speech enhancement due to their computational efficiency advantages over long-sequence modeling, in the specific task of target sound extraction, how to construct a lightweight, low-latency overall solution that blends state space model efficient long-range modeling capabilities, target condition guidance mechanisms, and mask generation strategies has not yet provided an efficient systematic design. Therefore, the current field still lacks a target sound extraction method capable of remarkably reducing model complexity and reasoning delay while ensuring high extraction precision, thereby meeting real-time and resource-limited application requirements. Disclosure of Invention Based on the above problems, the embodiment of the invention provides a method, a device and related equipment for extracting target sound based on a state space model, which are used for carrying out long time sequence modeling by introducing the state space model and improving the capability of accurately separating the target sound in real time in a complex acoustic environment on the premise of reducing the complexity of the model and the calculation cost by combining a mask generation mechanism of conditional guidance. The embodiment of the invention provides a target sound extraction method based on a state space model, which comprises the steps of obtaining a mixed audio signal containing target sound and priori condition information of the target sound, coding the mixed audio signal to obtain audio coding characteristics, conducting long time sequence dependency modeling on the audio coding characteristics by using the state space model to obtain time sequence modeling characteristics, generating condition embedded vectors corresponding to the target sound based on the priori condition information, conducting fusion processing on the time sequence modeling characteristics and the condition embedded vectors to generate masks for distinguishing the target sound, screening the audio coding characteristics by using the masks to obtain enhanced target sound characteristics, decoding the target sound characteristics to reconstruct time domain signals of the target sound, wherein a causal structure is adopted by a coding network and a decoding network, and processing is guaranteed to be only dependent on current and historical information at any moment. In one possible embodiment, the mixed audio signal is encoded, and the audio coding features are extracted, including: and carrying out layer-by-layer feature extraction and downsampling on t