CN-121983080-A - Brain-controlled voice enhancement method and system based on brain-electrical signal nerve decoding

CN121983080ACN 121983080 ACN121983080 ACN 121983080ACN-121983080-A

Abstract

The invention discloses a brain-controlled voice enhancement method and a brain-controlled voice enhancement system based on brain electrical signal nerve decoding, which comprise the steps of synchronously collecting multichannel voice signals and EEG signals, utilizing an EEG coding network to extract compact features related to auditory attention from the EEG signals, representing and constructing an auditory consensus field continuously evolving on a slow time scale as auditory cognitive state, carrying out time-frequency modeling on the multichannel voice signals based on deep learning, constructing a deep voice time-frequency analysis network, carrying out voice time-frequency spectrum gain modulation and nerve wave beam formation weight gain modulation on the signals by utilizing the auditory consensus field, converting the signals back to a time domain, and outputting a target voice signal after final enhancement. According to the invention, through constructing the auditory consensus field driven by the EEG, the frequency spectrum characteristic flow of the deep voice time-frequency analysis network and the nerve modulation of the space beam forming weight of the nerve network beam forming module are realized, so that the robustness and the stability of the voice enhancement and the system in a complex acoustic environment are improved.

Inventors

LU YUN
TAN SONG
LIN MINYU

Assignees

惠州学院

Dates

Publication Date: 20260505
Application Date: 20251231

Claims (10)

1. A brain-controlled speech enhancement method based on brain-electrical signal neural decoding, characterized in that the method comprises the following steps: Collecting multichannel voice signals in the environment through a multichannel microphone array; acquiring EEG signals that are synchronized in time with the multi-channel speech signals; Extracting compact features related to auditory attention from the EEG signal using the EEG encoding network as an auditory cognitive state representation; constructing an auditory consensus field continuously evolving on a slow time scale according to the auditory cognitive state representation; Performing time-frequency modeling on the multichannel voice signal based on deep learning, and constructing a deep voice time-frequency analysis network, wherein the deep voice time-frequency analysis network comprises an encoder, a decoder and multi-stage jump connection; performing voice time spectrum gain modulation on jump connection in the deep voice time frequency analysis network by utilizing the hearing consensus field; Carrying out spatial filtering on high-level semantic features output by a deep voice time-frequency analysis network, and forming prediction basis spatial filtering weights by utilizing a neural network wave beam; Mapping the hearing consensus field to a space attention subspace through an affine matrix, and performing nerve beam forming weight gain modulation on the prediction basis space filtering weight; And converting the signals subjected to the frequency spectrum gain modulation and the nerve beam forming weight gain modulation in the voice into a time domain, and outputting the final enhanced target voice signal.
2. The brain-controlled speech enhancement method based on brain electrical signal neural decoding according to claim 1, wherein compact features related to auditory attention are extracted from EEG signals as auditory cognitive state representations, specifically expressed as z EEG (t)＝φ EEG (E(c,t)),z EEG (t) representing t-time auditory cognitive state vectors, phi EEG (), representing a nonlinear mapping of EEG coding network, E (c, t) representing t-time acquired EEG signals, c representing the number of brain electrical channels.
3. The brain-controlled speech enhancement method based on brain electrical signal neural decoding according to claim 1, wherein the specific expression of the auditory consensus field is that S f (t)＝(1-λ f )S f (t-1)+λ f ·φ f (z EEG (t)),S f (t) represents an auditory consensus field state vector at a time t, S f (t-1) represents an auditory consensus field state vector at a previous time t-1, z EEG (t) represents a t-time auditory cognitive state vector, Φ f () represents a nonlinear mapping of the auditory consensus field for converting the cognitive state vector z EEG (t) into an instantaneous effect on the consensus field, and λ f represents a time update coefficient.
4. The brain-controlled speech enhancement method based on brain-electrical signal neural decoding according to claim 1, wherein the multi-channel speech signal is time-frequency modeled based on deep learning, a deep speech time-frequency analysis network is constructed, comprising an encoder, a decoder and multi-stage jump connection, wherein, The deep voice time-frequency analysis network comprises a multi-channel voice input and time-frequency representation module, a sub-band feature extraction module, an encoder, a bottleneck module, a decoder and a jump connection module, wherein, The multi-channel voice input and time-frequency representation module carries out short-time Fourier transform on input multi-channel voice signals, then carries out decomposition on complex frequency spectrums, extracts real parts and imaginary parts respectively to form tensors with dimensions (C, 2, T and F), wherein C represents the number of voice signal channels acquired by different microphones, T represents the number of time frames, F represents the number of frequency bins, and the tensors are rearranged and remolded to form tensors with new dimensions; the sub-band feature extraction module is used for enhancing local frequency correlation by extracting sub-band features from an input spectrogram; The encoder performs downsampling on the high-dimensional spectrum representation acquired from the subband feature extraction module through convolution operation, and gradually compresses the high-dimensional spectrum representation into a compact potential feature vector; the bottleneck module refines and enhances the potential feature vectors; the decoder reconstructs the extracted and enhanced potential feature vectors into original input dimensions through upsampling; The jump connection module pre-processes low-level features extracted by the encoder, then performs channel splicing on the low-level features and deep features of the decoder, performs channel alignment on the spliced feature images through 1x1 convolution, and finally improves the spatial resolution step by step through transposition convolution to enable the output feature images to be completely aligned with the original input in dimension.
5. The brain-controlled voice enhancement method based on brain-electrical signal nerve decoding according to claim 1, wherein the specific expression of spectral gain modulation during voice is: Where gamma i (c)＝σ(w i *S f (t)) represents the channel-level modulation vector of the auditory consensus field versus the speech time spectrum, SK i (c, t, f) represents the skip link feature of the i-th layer encoder output, The characteristic diagram after gain modulation is shown, t and f represent the f frequency point at the t moment, Representing element-by-element multiplication, w i represents a learnable linear mapping matrix, σ (·) is a nonlinear activation function, and c represents the number of electroencephalogram channels.
6. The brain-controlled speech enhancement method based on brain-electrical signal nerve decoding according to claim 1, wherein the specific expression of the nerve beam forming weight gain modulation is: Wherein gamma b ＝σ(w b ·S f (t)) represents the channel-level neuromodulation of the auditory consensus field S f (t) to the basis spatial filtering weights omega 0 (c, t, f), t, f represent the f-th frequency point at time t, Representing element-by-element multiplication, β is the modulation intensity coefficient, w b is a learnable linear mapping matrix, σ (·) is a nonlinear activation function, and c represents the number of electroencephalogram channels.
7. The brain-controlled speech enhancement method based on brain-electrical signal neural decoding according to claim 1, further comprising training a method model, comprising the steps of: inputting synchronous multichannel voice signals and EEG signals; constructing an auditory cognitive state representation via an EEG coding network and updating an auditory consensus field; Under the hearing consensus field modulation, completing the frequency spectrum modulation and nerve beam forming weight modulation during voice; Calculating a loss function based on complex-valued mean square error according to the enhanced voice and the target voice; and adopting an error back propagation algorithm to update all network parameters in a joint way.
8. A brain-controlled speech enhancement system based on brain electrical signal neural decoding, the system comprising: the voice acquisition module is used for acquiring multichannel voice signals in the environment through the multichannel microphone array; the EEG acquisition module is used for acquiring EEG signals which are synchronous in time with the multichannel voice signals; an EEG encoding module for extracting compact features related to auditory attention from the EEG signal as an auditory cognitive state representation; The hearing consensus field construction module is used for constructing a hearing consensus field continuously evolving on a slow time scale according to hearing cognition state representation; the deep voice time-frequency analysis network construction module is used for carrying out time-frequency modeling on the multichannel voice signals based on deep learning, and constructing a deep voice time-frequency analysis network, and comprises an encoder, a decoder and multi-stage jump connection; the EEG modulation frequency spectrum characteristic module is used for modulating frequency spectrum gain when the voice is carried out on jump connection in the deep voice time-frequency analysis network by utilizing the hearing consensus field; The neural network beam forming module is used for carrying out spatial filtering on the high-level semantic features output by the deep voice time-frequency analysis network, and predicting basic spatial filtering weights by utilizing the neural network beam forming; The EEG modulation beam forming weight module is used for mapping the hearing consensus field to a space attention subspace through an affine matrix and carrying out nerve beam forming weight gain modulation on the prediction basis space filtering weight; The voice reconstruction module is used for converting the signals of the frequency spectrum gain modulation and the nerve beam forming weight gain modulation when the voice is completed back to the time domain and outputting the target voice signals after the final enhancement.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the brain-controlled speech enhancement method steps based on brain-electrical signal neural decoding according to any one of claims 1 to 7 when the program is executed by the processor.
10. A non-transitory computer readable storage medium having stored thereon computer instructions, which when executed by a processor, implement the brain-controlled speech enhancement method steps based on brain electrical signal neural decoding according to any one of claims 1 to 7.

Description

Brain-controlled voice enhancement method and system based on brain-electrical signal nerve decoding Technical Field The invention relates to the technical field of brain-computer interfaces, in particular to a brain-controlled voice enhancement method and system based on brain-computer signal nerve decoding. Background In multi-speaker, strong noise and reverberant environments, traditional speech enhancement and beamforming methods rely primarily on the acoustic signal itself for modeling, typically using deep neural networks to learn time-frequency masks or spatial filtering weights from multi-channel speech. However, in the case of a low signal-to-noise ratio or insufficient acoustic cues, it is difficult to stably distinguish the target speech from the interfering speech depending on the acoustic information alone. In a complex acoustic environment, the human brain auditory system has a "cocktail party effect" capability, which can focus on target speech and ignore interfering speech. In recent years, research on cognitive neuroscience shows that the human auditory system can actively and selectively enhance a target sound source at two levels of frequency spectrum and space by utilizing an auditory attention brain mechanism. The process is not realized by directly reconstructing acoustic signals, but the nerve gain in the auditory information processing channel is continuously regulated and controlled by a nerve modulation mechanism from top to bottom of the brain. With the rapid development of Brain-computer interface (Brain-ComputerInterface, BCI) technology, a cognitive control voice enhancement method based on an Electroencephalogram (EEG) gradually becomes a research hotspot in the field of intelligent hearing assistance. The method aims at extracting and enhancing the concerned voice signals from the complex acoustic environment by decoding the hearing attention intention of the listener, realizing the hearing voice enhancement effect and remarkably improving the speech understanding of the listener in a cocktail party scene. Currently, the construction of cognitive control speech enhancement methods using Electroencephalogram (EEG) neural decoding can be largely divided into two main categories, two-stage method (two-stage approaches) and end-to-end method (end-to-end approaches). The two-stage method generally utilizes a voice separation algorithm (such as SepFormer, beam forming, etc.) to perform source separation on the mixed voice, then combines EEG signals to perform auditory attention decoding (AuditoryAttention Decoding, AAD), and finally selects a target voice output according to the decoding result. The method has stronger system robustness due to decoupling of voice enhancement and attention decoding, and can output intelligible voice even under the condition of poor EEG signal quality, so that the method is more suitable for a real application scene. However, the existing two-stage method mostly adopts auditory attention decoding strategy based on envelope reconstruction, only uses dynamic information related to voice energy fluctuation in EEG, ignores potential role of speaker-specific features (such as voiceprint) in decoding, and limits further improvement of decoding accuracy. In contrast, the end-to-end approach fuses the EEG signal directly with the mixed speech through a neural network, jointly models and outputs the enhanced target speech. The method can obtain good enhancement performance under controlled experimental conditions, and has stronger feature fusion capability. However, most of the existing EEG-assisted speech enhancement methods splice the electroencephalogram signals as additional features directly to the acoustic network input or for explicit attention selection, which easily introduces the problem of modality inconsistency and lacks explicit support of neuro-heuristic theory with limited system stability and interpretability. Disclosure of Invention The invention provides a brain-controlled voice enhancement method and a brain-controlled voice enhancement system based on brain electrical signal nerve decoding, aiming at providing an active hearing voice enhancement method which can guide voice enhancement and spatial filtering processes by taking EEG as a cognitive modulation signal instead of an acoustic feature from a brain heuristic mechanism. According to a first aspect of embodiments of the present disclosure, there is provided a brain-controlled speech enhancement method based on brain electrical signal neural decoding, the method comprising the steps of: Collecting multichannel voice signals in the environment through a multichannel microphone array; acquiring EEG signals that are synchronized in time with the multi-channel speech signals; Extracting compact features related to auditory attention from the EEG signal using the EEG encoding network as an auditory cognitive state representation; constructing an auditory consensus field continuously evolvi