CN-122024697-A - Speech motor imagination decoding system and method

CN122024697ACN 122024697 ACN122024697 ACN 122024697ACN-122024697-A

Abstract

The speech motor imagery decoding system comprises a syllable classifier, a speech synthesizer, a pronunciation motion synthesizer and a 13-dimensional pronunciation motion track, wherein the syllable classifier is used for acquiring target brain wave signals to be processed and predicting the target brain wave signals to obtain target syllable labels, the speech synthesizer is used for acquiring mel spectrograms corresponding to the target brain wave signals, the mel spectrograms are used for speech synthesis, the pronunciation motion synthesizer is used for acquiring 13-dimensional pronunciation motion tracks corresponding to the target brain wave signals, and the target syllable labels, the mel spectrograms and the 13-dimensional pronunciation motion tracks are used for determining speech motor imagery decoding results corresponding to the target brain wave signals. According to the scheme, accurate speech motor imagination decoding results can be obtained based on brain wave signals.

Inventors

ZHAO ZEHAO
WANG ZHENJIE
LIU YAN
LI YUANNING
LU JUNFENG
WU JINSONG

Assignees

复旦大学附属华山医院
上海科技大学

Dates

Publication Date: 20260512
Application Date: 20260415

Claims (10)

1. A speech motor imagery decoding system, comprising: The syllable classifier is used for acquiring a target brain wave signal to be processed, and predicting the target brain wave signal to obtain a target syllable tag; A voice synthesizer for obtaining a mel spectrogram corresponding to the target brain wave signal, wherein the mel spectrogram is used for voice synthesis; A pronunciation motion synthesizer for obtaining a 13-dimensional pronunciation motion track corresponding to the target brain wave signal; the target syllable tag, the mel spectrogram and the 13-dimensional pronunciation motion track are used for determining a speech motor imagination decoding result corresponding to the target brain wave signal.
2. The speech motor imagery decoding system of claim 1, wherein the syllable classifier employs a residual network architecture comprising a two-dimensional convolution layer, a batch normalization layer, a modified linear unit layer, a two-dimensional max pooling layer, four consecutive residual block groups, a global average layer, a flattening layer, a full-connection layer, a loss layer, and a softmax layer, which are sequentially arranged, wherein: the target brain wave signals are input to the two-dimensional convolution layer, and the two-dimensional convolution layer is used for extracting initial characteristics of the target brain wave signals; the batch normalization layer is used for carrying out batch normalization processing on the output of the two-dimensional convolution layer; the correction linear unit layer is used for carrying out nonlinear transformation on the output of the batch normalization layer; The two-dimensional maximum pooling layer is used for downsampling the output of the correction linear unit layer; the four continuous residual block groups are used for learning the output of the two-dimensional maximum pooling layer to obtain target characteristics corresponding to the target brain wave signals; the global averaging layer is used for carrying out global averaging pooling on the target features and compressing the space dimension into 1; The flattening layer is used for flattening the global average layer; The full connection layer is used for mapping the output of the flattening layer to a category space based on the category number; The loss layer is used for randomly losing a preset number of neurons; The softmax layer is used for outputting probability distribution corresponding to each class of syllable labels, the sum value of probabilities corresponding to the syllable labels of each class is 1, and the target syllable label is the syllable label with the highest probability.
3. The speech motor imagery decoding system of claim 2, wherein the four consecutive sets of residual blocks include a first set of residual blocks, a second set of residual blocks, a third set of residual blocks, and a fourth set of residual blocks, wherein: the first residual block group comprises three residual blocks, and fine granularity space-time characteristics are learned based on the output of the two-dimensional maximum pooling layer; The second residual block group comprises four residual blocks and is used for learning abstract features; The third residual block group comprises six residual blocks and is used for extracting high-level semantic features; the fourth residual block group includes three residual blocks for acquiring the target feature.
4. The speech motor imagery decoding system of claim 1, wherein the speech synthesizer employs a deep convolutional neural network architecture comprising four successive modules and a fully-connected layer arranged in series in sequence, wherein: and the target brain wave signal is input to a first continuous module in the four continuous modules, and after passing through the four continuous modules, the mel spectrogram is output through the full-connection layer.
5. The speech motor imagery decoding system of claim 4, wherein any one of the successive modules includes a two-dimensional convolution layer, a residual block, and a two-dimensional maximization pooling layer disposed in sequence, wherein: The device comprises a target brain wave signal, a two-dimensional convolution layer, a residual error block, a two-dimensional maximum pooling layer, a first continuous module, a first spatial downsampling module and a second spatial downsampling module, wherein the two-dimensional convolution layer of the first continuous module is used for extracting initial characteristics of the target brain wave signal; The two-dimensional convolution layer of the second continuous module is used for continuously extracting the characteristics of the output of the first continuous module, the residual error block of the second continuous module is used for carrying out residual error learning, and the two-dimensional maximum pooling layer of the second continuous module is used for carrying out second space downsampling; The system comprises a first continuous module, a second continuous module, a third continuous module, a two-dimensional convolution layer, a residual error block, a two-dimensional maximum pooling layer and a third continuous module, wherein the first continuous module is used for carrying out three-dimensional spatial downsampling; The system comprises a first continuous module, a second continuous module, a third continuous module, a fourth continuous module, a two-dimensional convolution layer, a residual error block and a two-dimensional maximum pooling layer, wherein the first continuous module is used for carrying out the output of the first continuous module, the second continuous module is used for carrying out the residual error learning, and the third continuous module is used for carrying out the output of the second continuous module; The full-connection layer comprises a flattening layer, a full-connection layer and a remolding layer which are sequentially arranged, wherein the flattening layer is used for flattening the result of the fourth time space downsampling, the full-connection layer is used for mapping the output of the flattening layer to mel frequency spectrum, and the remolding layer is used for remolding the output of the full-connection layer to obtain the mel frequency spectrum map.
6. The speech motor imagery decoding system of claim 1, wherein the speech motion synthesizer includes an encoder, a speech codebook, a decoder, arranged in sequence, wherein: the encoder comprises a two-dimensional convolution layer, 5 continuous residual blocks, a non-local block, a residual block, a group normalization layer, swish layers and two-dimensional convolution layers which are connected in series and are sequentially arranged; the decoder comprises a two-dimensional convolution layer, a residual block, a non-local block, 5 continuous residual blocks, a group normalization layer, swish layers and two-dimensional convolution layers which are sequentially arranged.
7. The speech motor imagery decoding system of claim 6, wherein a two-dimensional convolution layer in the encoder is configured to perform feature extraction on the target brain wave signal, wherein an input of any one of the 5 consecutive residual blocks in the encoder is an output of a downsampling layer corresponding to a previous residual block, wherein the non-local block is configured to obtain a dependency relationship of the output of the 5 consecutive residual blocks, wherein the residual block is configured to perform post-attention residual processing on the output of the non-local block, wherein the group normalization layer is configured to perform group normalization processing on the residual blocks, wherein the swish layers are configured to perform self-gating activation on the output of the group normalization layer, wherein a first two-dimensional convolution layer is configured to map the output of the swish layers to a potential spatial dimension, and wherein a second two-dimensional convolution layer is configured to compress the output of the first two-dimensional convolution layer.
8. The speech motor imagery decoding system of claim 7, wherein a two-dimensional convolution layer in the decoder is configured to recover codebook vectors of the speech codebook output from a latent space to a high-dimensional feature, wherein the residual block is configured to perform residual learning on an output of the two-dimensional convolution layer, wherein the non-local block is configured to obtain a dependency of the output of the residual block, wherein the 5 consecutive residual blocks are configured to perform depth feature processing and upsampling processing on the output of the non-local block, wherein the set of normalization layers is configured to perform set normalization processing on the output of the 5 consecutive residual blocks, wherein the swish layers are configured to perform self-gating activation on the output of the set of normalization layers, wherein a first two-dimensional convolution layer is configured to map the output of the swish layers to an AKT dimension, and wherein a second two-dimensional convolution layer is configured to perform time dimension adjustment on the output of the first two-dimensional convolution layer.
9. The speech motor imagery decoding system of any one of claims 1 to 8, wherein the target brain wave signal is obtained by: Acquiring an original brain wave signal acquired by a cortical brain wave electrode sheet; And performing downsampling processing on the original brain wave signals, and acquiring brain wave signals with the frequency in the beta1 frequency band from the downsampled signals to serve as the target brain wave signals.
10. A speech motor imagery decoding method, comprising: predicting a target brain wave signal to be processed to obtain a target syllable tag; Acquiring a mel spectrogram corresponding to the target brain wave signal, wherein the mel spectrogram is used for speech synthesis; acquiring a 13-dimensional pronunciation movement track corresponding to the target brain wave signal; and determining a speech motor imagination decoding result corresponding to the target brain wave signal based on the target syllable tag, the mel spectrogram and the 13-dimensional pronunciation motion track.

Description

Speech motor imagination decoding system and method Technical Field The invention relates to the technical field of brain electrical interfaces, in particular to a speech motor imagery decoding system and method. Background Language is the most basic communication form for human beings, and is a way for people to naturally and intuitively express thought and thinking. Clinically, however, significant neurological diseases such as cerebral apoplexy, amyotrophic lateral sclerosis (i.e. "freezing syndrome"), brain tumor, and atresia syndrome may cause serious dysarthria or aphasia, and patients may retain complete consciousness and thinking ability, but lose the important daily communication ability of speech output. The speech motor imagery is often considered as how to obtain accurate speech motor imagery decoding results based on brain wave signals by cutting off the speech pronunciation (speech articulation) process, and the speech motor imagery is a technical problem to be solved urgently. Disclosure of Invention The invention aims at providing a speech motor imagery decoding system and a speech motor imagery decoding method, which can obtain accurate speech motor imagery decoding results based on brain wave signals. The invention provides a speech motor imagery decoding system, which comprises a syllable classifier, a speech synthesizer, a pronunciation motion synthesizer and a 13-dimensional pronunciation motion track, wherein the syllable classifier is used for acquiring a target brain wave signal to be processed and predicting the target brain wave signal to obtain a target syllable tag, the speech synthesizer is used for acquiring a mel spectrogram corresponding to the target brain wave signal, the mel spectrogram is used for speech synthesis, the pronunciation motion synthesizer is used for acquiring a 13-dimensional pronunciation motion track corresponding to the target brain wave signal, and the target syllable tag, the mel spectrogram and the 13-dimensional pronunciation motion track are used for determining a speech motor imagery decoding result corresponding to the target brain wave signal. Optionally, the syllable classifier adopts a residual network architecture and comprises a two-dimensional convolution layer, a batch normalization layer, a correction linear unit layer, a two-dimensional maximum pooling layer, four continuous residual block groups, a global average layer, a flattening layer, a full-connection layer, a loss layer and a softmax layer which are sequentially arranged, wherein the target brain wave signal is input to the two-dimensional convolution layer, and the two-dimensional convolution layer is used for extracting initial characteristics of the target brain wave signal; the system comprises a two-dimensional convolution layer, a batch normalization layer, a correction linear unit layer, a two-dimensional maximization pooling layer, four continuous residual block groups, a global averaging layer, a full-connection layer, a softmax layer and a syllable label, wherein the batch normalization layer is used for carrying out batch normalization processing on the output of the two-dimensional convolution layer, the correction linear unit layer is used for carrying out nonlinear transformation on the output of the batch normalization layer, the two-dimensional maximization pooling layer is used for carrying out downsampling on the output of the correction linear unit layer, the four continuous residual block groups are used for learning the output of the two-dimensional maximization pooling layer to obtain target characteristics corresponding to target brain wave signals, the global averaging layer is used for carrying out global averaging pooling on the target characteristics and compressing space dimensions to be 1, the flattening layer is used for carrying out flattening operation on the global averaging layer, the full-connection layer is used for mapping the output of the flattening layer to a class space based on the class number, the loss layer is used for randomly losing a preset number of neurons, the softmax layer is used for outputting probability distribution corresponding to each class syllable label, the sum value of probability corresponding to each class syllable label is 1, and the target syllable label is the syllable label with the probability maximum. Optionally, the four continuous residual block groups comprise a first residual block group, a second residual block group, a third residual block group and a fourth residual block group, wherein the first residual block group comprises three residual blocks, fine granularity space-time features are learned based on the output of the two-dimensional maximum pooling layer, the second residual block group comprises four residual blocks for learning abstract features, the third residual block group comprises six residual blocks for extracting high-level semantic features, and the fourth residual block group comprises