CN-121987206-A - Depression recognition method and device based on state space model and multi-mode fusion

CN121987206ACN 121987206 ACN121987206 ACN 121987206ACN-121987206-A

Abstract

The application relates to the technical field of intersection of artificial intelligence and medical health, and discloses a depression identification method and device based on state space model and multi-mode fusion. The method comprises the steps of obtaining multichannel electroencephalogram data, preprocessing and segmenting to obtain electroencephalogram fragments with fixed time length, respectively obtaining space-time features and frequency domain features based on multi-modal feature extraction, carrying out time sequence context modeling on the space-time features to obtain context sensing features by adopting a state space model, carrying out deep fusion on the context sensing features and the frequency domain features to obtain a fusion feature sequence by a cross-modal fusion mechanism, carrying out multi-task decision based on the fusion feature sequence to obtain a depression classification result, a total depression severity score and a depression-related brain function network activation strength, and generating a clinical interpretability map corresponding to the depression classification result based on an interpretability visualization technology. The application fully digs the complementary information of the time-space-frequency three-dimensional characteristics and improves the auxiliary identification accuracy of the depression.

Inventors

YAO LI
ZHANG YADAN
CUI YANGYANG
LIU DONGYU
An nan

Assignees

杭州极弱磁场国家重大科技基础设施研究院

Dates

Publication Date: 20260508
Application Date: 20260409

Claims (10)

1. A depression identification method based on fusion of a state space model and multiple modes, comprising the steps of: Acquiring multichannel electroencephalogram data, preprocessing and segmenting the electroencephalogram data to obtain electroencephalogram fragments of each channel with fixed duration; executing multi-mode feature extraction on the electroencephalogram fragments of each channel to respectively acquire space-time features and frequency domain features; Carrying out time sequence context modeling on the time space characteristics by adopting a state space model to obtain context perception characteristics; Performing depth fusion on the context sensing features and the frequency domain features through a cross-modal fusion mechanism to obtain a fusion feature sequence; performing a multitasking decision based on the fusion feature sequence to obtain a depression classification result, a depression severity total score and a depression-related brain function network activation intensity; Based on the interpretive visualization technique, a clinical interpretive map corresponding to the classification result of depression is generated.
2. The depression recognition method based on the state space model and the multi-modal fusion according to claim 1, wherein the obtaining multi-channel electroencephalogram data, performing preprocessing and segmentation on the electroencephalogram data to obtain each channel electroencephalogram fragment with a fixed duration, comprises: Acquiring electroencephalogram data of 64 channels and 256Hz sampling rate; Band-pass filtering is respectively carried out on the electroencephalogram data of each channel by adopting an infinite impulse response filter to obtain filtering data of each channel, wherein the high-pass cutoff frequency is 0.5Hz, and the low-pass cutoff frequency is 45Hz; Denoising the filtered data of each channel by adopting an independent component analysis algorithm to obtain denoising data of each channel; Performing 64-lead re-referencing on the denoising data of each channel by adopting a whole brain average referencing method to generate correction data of each channel; And cutting the correction data of each channel into segments with the duration of 10 seconds respectively, and outputting an electroencephalogram segment with the dimension of [64,2560], wherein 64 represents the number of channels, and 2560 represents the number of sampling points.
3. The depression recognition method based on the state space model and the multi-modal fusion according to claim 2, wherein the performing multi-modal feature extraction on the electroencephalogram segments of each channel, respectively obtaining space-time features and frequency domain features, includes: Taking an electroencephalogram fragment with the dimension of [ batch size, 64,2560] as input, capturing a multi-scale time mode through parallel depth separable convolution comprising three convolution kernels of 3 multiplied by 1, 5 multiplied by 1 and 7 multiplied by 1, and obtaining three parallel output feature graphs, wherein batch normalization and ReLU activation functions are connected behind each convolution layer; after channel dimension stitching is performed on the three output feature graphs, global average pooling is performed along the time dimension and the space dimension respectively, and a time feature descriptor and a space feature descriptor are generated; Splicing the time feature descriptors and the space feature descriptors through channel dimensions, performing 1×1 convolution and a ReLU activation function, re-dividing, and merging after the 1×1 convolution respectively to generate channel weight vectors with the dimensions of [ batch size, 192,1 ]; Generating a two-dimensional spatial attention map with dimensions of [ batch size, 1,2560] by element addition, 1×1 convolution and Sigmoid function of the time feature descriptor and the spatial feature descriptor; Multiplying the channel weight vector and the two-dimensional space attention map element by element to obtain an attention weight, and multiplying the attention weight and the spliced three output feature maps element by element to obtain a weighted feature map; the weighted feature map is convolved by 1x1 to output a spatio-temporal feature sequence with dimensions [ batch size, 128,2560 ].
4. The depression recognition method based on the state space model and the multi-modal fusion according to claim 2, wherein the performing multi-modal feature extraction on the electroencephalogram segments of each channel, respectively obtaining space-time features and frequency domain features, includes: Respectively performing short-time Fourier transform on the electroencephalogram fragments of each channel by adopting a Hanning window to obtain a time-frequency spectrogram of each channel, wherein the window length is 256 samples, the overlapping length is 128 samples, and the number of Fourier transform points is 512; Respectively carrying out power integration on five frequency bands on a time-frequency spectrogram to obtain integrated power of each frequency band, and normalizing the integrated power of each frequency band to the sum of the integrated powers of the five frequency bands to obtain the relative power of each frequency band, wherein the five frequency bands comprise [1,4 ] Hz Delta frequency band, [4,8 ] Hz Theta frequency band, [8, 13) Hz Alpha frequency band, [13,30 ] Hz Beta frequency band and [30,45] Hz Gamma frequency band; Calculating the average relative power of the relative power of each frequency band of each channel on all time windows to obtain a frequency vector with the dimension of [ batch size, 64,5 ]; The frequency vector is flattened in the channel number and frequency band number dimensions and projected as a frequency domain feature map with dimensions [ batch size, 320,1], where 320 = 64 channels x 5 frequency bands.
5. The depression recognition method based on the fusion of a state space model and multiple modes according to claim 3, wherein the time sequence context modeling is performed on the time space feature by using the state space model to obtain a context perception feature, and the method comprises the following steps: 4 Mamba blocks are stacked to form a layered encoder, wherein the hidden dimension of the layered encoder is 512, the expansion multiple of the gating MLP is 2, the state dimension of the first two Mamba blocks is 8, and the state dimension of the second two Mamba blocks is 16; Taking a space-time feature sequence with the dimension of [ batch size, 128,2560] as an input, sequentially executing input projection, selective state space calculation, gating MLP processing, normalization and residual error connection operation by each Mamba block of the layered encoder, and outputting a context-aware feature with the dimension of [ batch size, 512,2560 ].
6. The depression identification method based on state space model and multi-modal fusion according to claim 5, wherein each Mamba block of the layered encoder performs input projection, selective state space computation, gated MLP processing, normalization and residual connection operations in sequence, comprising: Converting the input dimension into a batch size 512,2560 through preposed layer normalization and a linear layer to obtain an expanded input; extending the extended input projection to a projection input with a dimension of [ batch size, 1024,2560], and splitting the projection input into two branch inputs with dimensions of [ batch size, 512,2560 ]; single step input of inputting a branch Generating three input dependent parameters through three independent linear layers respectively, wherein the input dependent parameters comprise step size parameters Input matrix And output matrix ; State transition matrix Initializing into HIPPO structured matrix and keeping fixed in training to serve as a priori basis; continuous system parameters using zero-order hold method Discretization into The discretization formula is: , , Wherein, the Representing the identity matrix; Based on the discretized system parameters, sequentially executing state space recursion calculation according to sampling points to obtain selective state space output with the dimension of [ batch size, 512,2560] with the formula: , , Wherein, the Represent the first The hidden state of the individual sampling points, Represent the first The hidden state of the individual sampling points, Represent the first Single step output of the sampling points; And generating a filtering characteristic by the other branch input through a branch linear layer and SiLU activation functions, multiplying the selective state space output by the filtering characteristic element by element to obtain an enhancement characteristic, and connecting the enhancement characteristic with an extension input residual.
7. The depression recognition method based on state space model and multi-modal fusion according to claim 5, wherein the performing deep fusion on the context-aware feature and the frequency domain feature by the cross-modal fusion mechanism to obtain a fused feature sequence comprises: Reducing the dimension of the context perception feature with the dimension of [ batch size, 512,2560] through a linear layer to obtain a Query vector with the dimension of [ batch size, 128,2560 ]; Broadcasting the frequency domain feature map with the dimension of [ batch size, 320,1] along the time dimension to obtain an aligned frequency domain feature with the dimension of [ batch size, 320,2560 ]; Reducing the dimension of the aligned frequency domain features through two independent linear layers to respectively generate Key vectors and Value vectors with the dimension of [ batch size, 128,2560 ]; Splitting Query, key, value vectors into 4 groups of sub-vectors according to characteristic dimensions by adopting 4-head attention to form a multi-head vector set with dimensions of [ batch size, 4,2560,32 ]; Performing scaling dot product attention calculation and Softmax normalization on each group of sub-vectors in the multi-head vector set to obtain an attention weight matrix of 4 heads; weighting and summing the attention weight matrix and the corresponding sub Value vector to obtain 4 groups of attention characteristics; Performing stitching and dimension rearrangement on the 4 groups of attention features to obtain primary fusion features with the dimension of [ batch size, 128,2560 ]; And carrying out residual connection and layer normalization on the primary fusion features and the Query vector, and outputting a fusion feature sequence with the dimension of [ batch size, 128,2560 ].
8. The method for identifying depression based on fusion of a state space model and multiple modes according to claim 7, wherein the performing a multitasking decision based on the fusion feature sequence results in a depression classification result, a depression severity score and a depression-related brain function network activation intensity comprises: performing global average pooling on the fusion feature sequences to obtain global feature vectors with the dimension of [ batch size, 128 ]; Inputting the global feature vector into a depressive disorder classification task head, and outputting a depressive disorder classification result by the depressive disorder classification task head through a full connection layer, a ReLU, a Dropout, a full connection layer and a Softmax framework, wherein the depressive disorder classification task head adopts a cross entropy loss function; inputting the global feature vector into a severity regression task head, and outputting total severity score of depression by the severity regression task head through a full-connection layer, a ReLU and a full-connection layer architecture, wherein the severity regression task head adopts a Huber loss function; Inputting the global feature vector into a semantic auxiliary task head, and outputting brain function network activation intensity related to depression by the semantic auxiliary task head through a full-connection layer, tanh and full-connection layer architecture, wherein the semantic auxiliary task head adopts a mean square error loss function; the depression classification task head, the severity task head and the semantic auxiliary task head are jointly trained by the weighted total loss of the cross entropy loss function, the Huber loss function and the mean square error loss function.
9. The depression identification method based on the state space model and the multi-modal fusion according to claim 8, wherein the generating a clinical interpretability map corresponding to the depression classification result based on the interpretability visualization technology comprises: extracting an attention weight matrix, averaging the attention weight matrix in a head dimension to obtain an aggregate attention matrix, performing recursive calculation based on the aggregate attention matrix after initializing a unit matrix, performing gradient back propagation to obtain the contribution degree of each sampling point to the classification result of the depression, and generating a sequence importance thermodynamic diagram based on the contribution degree; Calculating gradients of channels of one sample of a time-space feature sequence output before a depression classification task head softmax, carrying out global average pooling on the gradients in a time dimension to obtain weight vectors, calculating global average of one sample of the time-space feature sequence in the time dimension to obtain feature importance vectors, carrying out weighted combination on the weight vectors and the feature importance vectors, generating scores with the dimensions of [1,64] through a full connection layer, carrying out normalization on the scores, interpolating on the scores to a two-dimensional scalp plane, and generating a brain region contribution degree thermodynamic diagram; Extracting weight parameters of a first full-connection layer in a depression classification task head, calculating Euclidean norms along 64-dimensional output dimensions for each row of the weight parameters, taking an average value, calculating gradients of the average value on depression class output probabilities to obtain 128-dimensional feature importance scores, tracing back projection connection weights of a fusion feature sequence and a frequency domain feature map, back projecting the feature importance scores to each frequency band according to the projection connection weights to obtain contribution scores of channels of each frequency band, summing and averaging the contribution scores of all channels in each frequency band to obtain relative contribution values of each frequency band, and generating a frequency band importance histogram according to the relative contribution values; the sequence importance thermodynamic diagram, brain region contribution thermodynamic diagram and frequency band importance histogram are integrated into a clinical interpretability map.
10. A depression recognition device based on a state space model and multi-modal fusion, comprising: The data preprocessing module is used for acquiring multichannel electroencephalogram data, preprocessing and segmenting the electroencephalogram data to obtain electroencephalogram fragments of each channel with fixed duration; the multi-modal feature extraction module is used for executing multi-modal feature extraction on the electroencephalogram fragments of each channel to respectively acquire space-time features and frequency domain features; The time sequence context modeling module is used for performing time sequence context modeling on the time sequence features by adopting the state space model to obtain context perception features; the cross-modal fusion module is used for performing depth fusion on the context sensing characteristics and the frequency domain characteristics through a cross-modal fusion mechanism to obtain a fusion characteristic sequence; The multi-task decision module is used for executing multi-task decision based on the fusion characteristic sequence to obtain a depression classification result, a depression severity total score and a depression related brain function network activation intensity; And the interpretability visualization module is used for generating a clinical interpretability map corresponding to the depression classification result based on the interpretability visualization technology.

Description

Depression recognition method and device based on state space model and multi-mode fusion Technical Field The application relates to the technical field of intersection of artificial intelligence and medical health, in particular to a depression identification method and device based on state space model and multi-mode fusion. Background At present, an electroencephalogram signal analysis method based on deep learning has a certain application in depression recognition. Mainstream methods include extracting local features using convolutional neural networks, or capturing long-range dependencies using a transducer model. For example, in the prior art, a classification model is often constructed by combining a CNN (convolutional neural network) with an RNN (recurrent neural network) or a self-attention mechanism, so as to perform time sequence modeling and classification on multichannel EEG (brain wave) signals. However, the prior art has the following significant drawbacks: first, in terms of long-sequence processing, when the existing model based on the self-attention mechanism processes the EEG signal with more than 1000 sampling points, the computational complexity reaches O (L2) (L is the sequence length), so that the memory occupation of the GPU exceeds 16GB, and the long-time EEG record of clinical acquisition cannot be processed. Secondly, in the aspect of feature extraction, the utilization of the frequency domain features by the existing method is limited to simple power spectrum calculation, depth feature fusion of three dimensions of time-space-frequency cannot be realized, and the accuracy rate on a public data set is difficult to break through 90%. Thirdly, in terms of model interpretation, the existing visualization method is limited to attention weight display, and can not provide brain region positioning and frequency band analysis with clinical significance, so that diagnosis requirements are difficult to meet. Fourth, in terms of model generalization, the accuracy of the existing single-task learning model in cross-center verification is reduced by more than 15%, and clinical popularization and application are severely restricted. Therefore, development of a depression identification method which can efficiently process long-sequence brain signals, deeply fuses multidimensional features, has clinical interpretability and strong generalization capability is needed to solve the defects of the prior art and promote intelligent auxiliary diagnosis of depression. Disclosure of Invention In order to solve the problems, the application provides a depression identification method and device based on state space model and multi-mode fusion, which processes long-sequence electroencephalogram signals with linear complexity, fully digs complementary information of time-space-frequency three-dimensional characteristics, improves the auxiliary depression identification accuracy, provides explanatory evidence and provides reliable support for clinical auxiliary diagnosis. The application adopts the following technical scheme: In a first aspect, the present application provides a depression identification method based on fusion of a state space model and a multimodal, comprising: Acquiring multichannel electroencephalogram data, preprocessing and segmenting the electroencephalogram data to obtain electroencephalogram fragments of each channel with fixed duration; executing multi-mode feature extraction on the electroencephalogram fragments of each channel to respectively acquire space-time features and frequency domain features; Carrying out time sequence context modeling on the time space characteristics by adopting a state space model to obtain context perception characteristics; Performing depth fusion on the context sensing features and the frequency domain features through a cross-modal fusion mechanism to obtain a fusion feature sequence; performing a multitasking decision based on the fusion feature sequence to obtain a depression classification result, a depression severity total score and a depression-related brain function network activation intensity; Based on the interpretive visualization technique, a clinical interpretive map corresponding to the classification result of depression is generated. In a second aspect, the present application also provides a depression recognition device based on fusion of a state space model and multiple modes, including: The data preprocessing module is used for acquiring multichannel electroencephalogram data, preprocessing and segmenting the electroencephalogram data to obtain electroencephalogram fragments of each channel with fixed duration; the multi-modal feature extraction module is used for executing multi-modal feature extraction on the electroencephalogram fragments of each channel to respectively acquire space-time features and frequency domain features; The time sequence context modeling module is used for performing time sequence context modeling o