Search

CN-121971093-A - Depression detection method based on time-frequency space multidimensional feature mining and cross-modal attention fusion

CN121971093ACN 121971093 ACN121971093 ACN 121971093ACN-121971093-A

Abstract

The invention belongs to the field of intelligent emotion monitoring, and particularly relates to a depression detection method ‌ based on time-frequency space multidimensional feature mining and cross-modal attention fusion. The time sequence and space information in the video captured by the transducer are introduced, and the ResNet model added into the cross-attention module is improved to mine the audio characteristic information. A network structure model based on three branches is constructed, and the aim is to fully extract and fuse the characteristics of video, voice and text. The method can accurately and rapidly quantify the psychological abnormal condition of the subject, and is suitable for campus psychological health monitoring and clinical diagnosis assistance.

Inventors

  • JIANG NAN
  • TIAN ZHONGYU
  • LI SUYUAN
  • QIN JIA
  • WANG YIMING
  • WANG HUAPENG
  • YANG HONGCHEN
  • DING KEN
  • LIU ZHUO

Assignees

  • 中国刑事警察学院

Dates

Publication Date
20260505
Application Date
20260403

Claims (7)

  1. 1.A depression detection method based on time-frequency space multidimensional feature mining and cross-modal attention fusion is characterized by comprising the following steps: S1, preprocessing a depression data set, including video preprocessing, audio preprocessing and text preprocessing, completing data enhancement, and improving robustness and effectiveness of subsequent feature extraction; s2, after video preprocessing features, audio feature preprocessing features and text preprocessing features are obtained, a network model is constructed, video feature extraction branches, audio feature extraction branches and text feature extraction branches are designed, and the feature characterization capability is further optimized; S3, modeling dynamic changes among key frames based on video preprocessing characteristics, mining time dimension characteristics of the video, and capturing micro expressions and action modes related to depression; s4, capturing the frequency spectrum characteristic, the frequency characteristic, the energy characteristic and the time characteristic of the audio signal from different angles based on the audio preprocessing characteristic; s5, based on text preprocessing characteristics, adopting a pre-training model to extract acoustic characteristics at multiple angles, and capturing context information in a sequence to more comprehensively understand the semantics of the text; S6, constructing a cross-mode self-attention fusion mechanism, and fusing the features of video, audio and text modes in pairs to realize multi-stage feature interaction; S7, training a network model by adopting an Adam optimizer, ending training when the model tends to be stable, otherwise, turning to S1 circulation; The method can be used for training and obtaining a depression detection method with high accuracy and stability.
  2. 2. The depression detection method based on time-frequency space multidimensional feature mining and cross-modal attention fusion according to claim 1, wherein the preprocessing of the depression dataset in step S1 comprises video preprocessing, audio preprocessing and text preprocessing, and is specifically as follows: Extracting video key frames at first and then carrying out video recombination, extracting every 50 frames according to the time sequence when the number of the key frames is 100 or 200, and forming a new video sample when the number of the key frames is more than or equal to 300 frames, randomly extracting 50 frames according to the time sequence to form a video sample, and extracting 50 frames at each time of the rest key frames at equal intervals to form a new video; The method comprises the steps of obtaining a new video, extracting a face area, rotating and cutting the detected face to ensure that the face is more regular, calculating positions of two eyes to determine a rotation angle, carrying out face alignment by affine transformation, detecting an aligned face image in a key frame by using a MTCNN model, and cutting the detected face image; The method comprises the steps of preprocessing audio frequency from two aspects of time domain and frequency domain, adding white noise, pink noise and Gaussian noise on the basis of original audio frequency to obtain time domain noise voice samples, randomly clipping voice fragments, synthesizing residual voice into time domain clipping voice samples, simulating frequency spectrum deletion through a frequency domain random mask by a frequency domain enhancement method to obtain frequency domain mask voice samples, and carrying out pitch conversion to obtain frequency domain pitch voice samples; The text preprocessing obtains text data through audio-text conversion and proofreading, transcribes original audio and the audio after preprocessing enhancement into text, and further supplements and corrects the text after transcription aiming at some missing and incorrect contents.
  3. 3. The depression detection method based on time-frequency space multidimensional feature mining and cross-modal attention fusion according to claim 1, wherein in the step S2, a network model is constructed based on the preprocessing result in the step S1, feature extraction branches are designed, and the feature characterization capability is further optimized, specifically: Introducing a transducer on the basis of ResNet50 0 models in video branches, adding a position coding modeling video feature sequence, capturing time sequence information and space information in the video, extracting six audio features including a mel frequency cepstrum coefficient in audio branches, wherein the six audio features are more abundant in feature representation compared with a mel frequency spectrogram which only extracts audio, reconstructing an audio feature map including time domain features and frequency domain features after NetVLAD clustering dimension reduction, introducing a cross attention module by taking ResNet as a basic model, extracting local details and global structure information of the feature map, extracting text features in text branches by using a Bert-BiGRU network to integrate global context information, reconstructing a feature fusion module which fuses cross-modal attention and self-attention mechanisms based on residual structure design, fusing the video features, voice features and text features processed by the three branches in a cross-modal manner, and finally outputting predicted BDI-II scores and classifying the BDI-II scores according to BDI-II values.
  4. 4. The depression detection method based on time-frequency space multidimensional feature mining and cross-modal attention fusion according to claim 1, wherein in the step S3, dynamic changes among key frames are modeled, time dimension features of videos are mined, and micro expressions and action modes related to depression are captured based on the visual preprocessing features in the step S2, specifically: In a video feature extraction part, firstly capturing depth semantic information of an image on a large-scale image dataset by utilizing a pre-trained ResNet model to obtain local visual features and global visual features in a video frame, flattening the input features, expanding the spatial dimension of the image into one-dimensional vectors, normalizing and standardizing the input features, mapping the flattened features onto an input dimension d_model of a transducer through input projection, simultaneously adding dropout to prevent overfitting, adding position codes to introduce sequence order information, and solving the problem that the transducer model cannot capture the sequence order relation, wherein the position codes adopt a learnable parameter matrix, and the addition operation of the position codes and the features can be expressed as follows: ; Wherein, the In order to input the characteristics of the feature, Is a position coding matrix; The transform encoder is composed of a plurality of encoder layers, each encoder layer comprising a multi-headed self-attention mechanism and a feedforward neural network, wherein the multi-headed self-attention mechanism calculates a correlation weight between different positions by mapping input features to a plurality of different attention spaces, and the formula is: ; Wherein, the K, V are query, key and value matrices respectively, Is a projection matrix; ; Wherein, the 、 、 Is the parameter matrix of the ith attention header, The method comprises the steps of outputting a projection matrix, calculating a plurality of attention heads in parallel, capturing different representations of features, and carrying out nonlinear transformation on the features of each position by a feedforward neural network, wherein the formula is as follows: ; Wherein, the 、 、 、 Is a parameter which can be learned, and enhances the expression capacity of the model; and finally, carrying out linear transformation and dropout operation on the output of the transducer, fully utilizing the time sequence relevance among key frames by combining the strong sequence modeling capability of the transducer model, and mining the time dimension characteristics of the video to obtain final characteristic representation.
  5. 5. The method for detecting depression based on time-frequency space multi-dimensional feature mining and cross-modal attention fusion according to claim 1, wherein the step S4 captures the spectral features, frequency features, energy features and time features of the audio signal from different angles, specifically, In view of the remarkable differences of voice signals of patients suffering from depression in terms of tone, speech speed and rhythm, loudness and energy compared with healthy people, the characteristics of a Mel frequency spectrum, MFCC, GFCC, LPCC, a broadband language graph and a narrowband language graph are selected and extracted, the Mel frequency spectrum can simulate human ear perception characteristics to reflect the changes of the voices of the patients suffering from depression in terms of tone and loudness, the MFCC can characterize the short-time energy and frequency spectrum characteristics of the voices to reflect tone and tone information, the voice differences of the patients suffering from depression are quantified, when the number of the MFCC coefficients is optimized to 40, the emotion recognition rate is remarkably improved, GFCC is based on a Gammatone filter bank, the characteristics of the voice frequency spectrum envelope and the voice channel are better met, the voice generation mechanism and the state differences of the voice channels of the patients suffering from depression can be captured, the short-time Fourier transform converts the voice signals from the time domain to the time domain, the broadband language graph has better time resolution, the narrowband language graph captures the high-frequency details more finely, and can know the time-frequency characteristics of the voices of depression, and provides abundant characteristic information; And inputting the feature map to a Resnet network added with CAFM attention modules, wherein the cross attention modules fully fuse multi-scale audio features, so that the model can simultaneously utilize local detail and global structure information, the comprehensive understanding capability of the Resnet network on audio signals is improved, and finally, the audio feature vectors are obtained through normalization and used for cross-modal feature fusion.
  6. 6. The depression detection method based on time-frequency space multidimensional feature mining and cross-modal attention fusion according to claim 1, wherein the text preprocessing feature in the step S5 is based on a pre-training model, acoustic features are extracted from multiple angles, context information in a sequence is captured, and the semantics of a text are more comprehensively understood, specifically comprising the following steps: capturing context information of words and sentences by adopting BERT, processing sequence data by a bidirectional GRU model through encoding a text sequence, capturing the context information in the sequence, more comprehensively understanding the semantics of the text, carrying out attention weighting on the output of the GRU, calculating the attention weight of each time step, carrying out weighted summation on the features fused with the front and rear information by using the attention weight distribution, obtaining final attention output, and carrying out feature normalization.
  7. 7. The depression detection method based on time-frequency space multidimensional feature mining and cross-modal attention fusion according to claim 1, wherein the step S6 is characterized in that a cross-modal self-attention fusion mechanism is constructed, features of video, audio and text modes are fused in pairs, multi-stage feature interaction is realized, semantic expression capacity of the features is gradually enhanced, features of different modes are effectively fused, feature expression capacity is improved, and therefore performance of a model is improved, and the method specifically comprises the following steps: The cross-modal attention mechanism fuses the features of the video, audio and text modes in pairs, calculates 6 groups of cross-modal attention weights through linear transformation of query, key and value, and obtains the features after the cross-modal feature enhancement through weighting integration: ; where q is the query modality feature, k is the key-value modality feature, represents matrix multiplication, d is the embedding dimension; After the cross-modal attention features are obtained, the features of each mode are spliced with the corresponding cross-modal attention features, and the features after the two modes are fused are obtained through mapping by a linear layer; the method comprises the steps of obtaining video fusion characteristics through a linear layer after video characteristics are spliced with cross-modal attention characteristics of a text and an audio, obtaining text fusion characteristics through a linear layer after text characteristics are spliced with cross-modal attention characteristics of the audio and the video, and obtaining audio fusion characteristics through a linear layer after audio characteristics are spliced with cross-modal attention characteristics of the video and the text; ; The fusion features of multiple modes are spliced to obtain a comprehensive feature vector, and the vector is mapped through a linear layer to obtain a final fusion feature; ; The self-attention module internally calculates the attention weight of the input feature by using linear transformation, and then obtains the self-attention output feature by weighting and summing; ; Wherein, the 、 And Is a learnable weight matrix, calculates an attention weight matrix a: ; Mapping the final self-attention output characteristics and the original characteristics to a category space through residual connection by a linear layer to obtain a final output result: ; Wherein, the Is a weight matrix of the linear layer; Further, in step S7, an Adam optimizer is designed to obtain an optimal depression detection model.

Description

Depression detection method based on time-frequency space multidimensional feature mining and cross-modal attention fusion Technical Field The invention belongs to the field of intelligent emotion monitoring, and particularly relates to a depression detection method ‌ based on time-frequency space multidimensional feature mining and cross-modal attention fusion. Background In order to improve objectivity and universality of mental health monitoring, the depression detection method based on time-frequency space multidimensional feature mining and cross-modal attention fusion has important potential in the fields of remote diagnosis and treatment, intelligent wearing, mental assessment and the like. The technology mainly realizes non-contact psychological state assessment by analyzing multi-modal behavior data (such as facial expressions, voice signals, limb actions and the like) of an individual and capturing depression-related fine modes from complex time domain, frequency domain and space features by combining an artificial intelligence technology. The traditional depression diagnosis mainly depends on scale inquiry and clinical observation, not only has subjective deviation, but also is difficult to realize continuous monitoring in daily scenes. Therefore, the intelligent depression detection method with high sensitivity and high specificity is researched, the deep analysis and fusion of the multidimensional behavior characteristics are realized, and the method has remarkable clinical significance and social value. The depression detection method based on time-frequency space multidimensional feature mining can be mainly divided into a method based on traditional machine learning and a method based on deep learning. The traditional method generally extracts statistical features (such as MFCC coefficients of voice, intensity change of facial action units, frequency features of limb movements and the like) from multi-modal data manually, and adopts a classifier such as a support vector machine or random forest to identify depression states. However, such methods have difficulty in effectively modeling depression-related high-order nonlinear features, and have insufficient capture of interaction relationships between cross-modal features, resulting in limited generalization performance in complex real scenes. In recent years, a depression detection method based on deep learning is a research hotspot due to the strong characteristic learning capability. The deep neural network can directly jointly learn the time-frequency-space multidimensional features related to depression from the original multi-mode data, and avoids the limitation of artificial feature design. However, the robustness of the existing methods is still to be improved in the face of challenges such as data heterogeneity (e.g., sample rate differences between modalities), individual behavior variability (e.g., cultural background, expression habits), and environmental interference (e.g., illumination, noise). The method based on time-frequency space multidimensional feature mining and cross-modal attention fusion remarkably improves the accuracy and the interpretability of depression detection by constructing a hierarchical time-space diagram network modeling behavior dynamic evolution rule and introducing a cross-modal attention mechanism self-adaptive weighting key feature channel. Disclosure of Invention The invention provides a depression detection method based on time-frequency space multidimensional feature mining and cross-modal attention fusion, which introduces a transform modeling video feature sequence, captures time sequence information and space information in a video, combines a cross attention module to extract local information and global information of features such as a reconstructed Mel spectrogram and the like, and further improves detection performance by fusing the features such as the video, voice, text and the like through a cross-modal attention mechanism. The technical scheme adopted by the invention is as follows: A depression detection method based on time-frequency space multidimensional feature mining and cross-modal attention fusion specifically comprises the following steps: S1, preprocessing a depression data set, including video preprocessing, audio preprocessing and text preprocessing, completing data enhancement, and improving robustness and effectiveness of subsequent feature extraction; s2, after video preprocessing features, audio feature preprocessing features and text preprocessing features are obtained, a network model is constructed, video feature extraction branches, audio feature extraction branches and text feature extraction branches are designed, and the feature characterization capability is further optimized; S3, modeling dynamic changes among key frames based on video preprocessing characteristics, mining time dimension characteristics of the video, and capturing micro expressions and action m