CN-121980486-A - Multi-modal depression assessment system and method based on gradient embedding and modal complementation

CN121980486ACN 121980486 ACN121980486 ACN 121980486ACN-121980486-A

Abstract

The invention relates to a multi-mode depression evaluation system and method based on gradient embedding and modal complementation, wherein the system comprises a long-time video face extraction and coding module, a long-time voice Mel spectrum extraction module, a sequence normalization and short-time sequence generation module, a step extraction module, a second-order gradient extraction module, a mixed step embedding module, a multi-head attention module with multi-stage vectorization, a dimension regulation module, a modal complementation representation generation module, a modal complementation module and a depression degree prediction module, namely, the capturing capability of dynamic characteristics in an audio-video sequence is effectively enhanced by introducing a first-order gradient embedding mechanism and a second-order gradient embedding mechanism, multi-level characterization in a time sequence evolution process is fully extracted by utilizing a multi-stage attention mechanism, and accurate complementation among audio-video information is realized by constructing a cross-mode complementation fusion mechanism. The system significantly improves the recognition capability and evaluation accuracy of depression-related behavioral cues.

Inventors

NIU MINGYUE
WANG XIAOYANG
SONG HONGYI
XIAO DIAOYI
SHAO ZHUHONG

Assignees

燕山大学

Dates

Publication Date: 20260505
Application Date: 20251203

Claims (10)

1. A multi-modal depression assessment system based on gradient embedding and modal complementation, comprising: The long-time video face extraction and encoding module is used for extracting a face image sequence from the input long-time video and encoding the face image sequence into a video time sequence; the long-term voice Mel spectrum extraction module is used for extracting Mel spectrum from the inputted long-term voice to obtain an audio time sequence; The sequence normalization and short-time sequence generation module is connected with the long-time video face extraction and coding module and the long-time voice Mel spectrum extraction module, and is used for cutting and generating a short-time video sequence and a short-time audio sequence with fixed lengths from the video time sequence and the audio time sequence, and normalizing the short-time video sequence and the short-time audio sequence respectively; The step extraction module is connected with the sequence normalization and short-time sequence generation module and is used for respectively extracting first-order gradients of the short-time video sequence and the short-time audio sequence; the second-order gradient extraction module is connected with the sequence normalization and short-time sequence generation module and is used for respectively extracting the second-order gradients of the short-time video sequence and the short-time audio sequence; The mixed gradient embedding module is connected with the sequence normalization and short-time sequence generation module, the first-order gradient extraction module and the second-order gradient extraction module and is used for fusing the first-order gradient and the second-order gradient to generate mixed gradient information, and the mixed gradient information is respectively embedded into a corresponding short-time video sequence or short-time audio sequence to obtain a video gradient enhancement sequence and an audio gradient enhancement sequence; The multi-head attention module with multi-stage vectorization is connected with the mixed-order gradient embedding module and is used for respectively carrying out multi-stage processing and aggregation on the video gradient enhancement sequence and the audio gradient enhancement sequence to obtain video vectorization representation and audio vectorization representation; the dimension regulation module is connected with the multi-head attention module with the multi-stage vectorization and is used for mapping the video vectorization representation and the audio vectorization representation to the same dimension space to respectively obtain video regulation representation and audio regulation representation; The modal complementary representation generation module is connected with the dimensionality normalization module and is used for calculating the difference value between the video normalization characterization and the audio normalization characterization to generate modal complementary representation; the mode complementation module is connected with the mode complementation representation generation module and is used for carrying out weighted fusion on the corresponding video regular representation or audio regular representation by utilizing the mode complementation representation to obtain a fusion representation after complementation; And the depression degree prediction module is connected with the modal complementation module and is used for outputting a depression degree prediction score based on the fusion characterization.
2. The system of claim 1, wherein the mixed-order gradient embedding module comprises: The gradient combination module is connected with the first-order gradient extraction module and the second-order gradient extraction module, and is configured to splice the first-order gradient and the second-order gradient and generate mixed gradient weight through a convolution layer and a Softmax function; And the gradient adjustment module is connected with the gradient combination module and the sequence normalization and short-time sequence generation module and is configured to embed the mixed order gradient weight into a corresponding short-time video sequence or short-time audio sequence through Hadamard product operation.
3. The system of claim 1, wherein the multi-headed attention module with multi-stage vectorization comprises: the multi-head sequence vectorization module is connected with the mixed order gradient embedding module and is configured to process the video gradient enhancement sequence and the audio gradient enhancement sequence through N layers of multi-head self-attention mechanisms respectively and to perform one-dimensional convolution on the output of each layer respectively so as to obtain a staged sequence vector; And the sequence aggregation module is connected with the multi-head sequence vectorization module and is configured to splice N staged sequence vectors and aggregate the N staged sequence vectors into the video vectorization representation or the audio vectorization representation through an attention mechanism.
4. The system of claim 1, wherein the modality complementary representation generation module is configured to: Subtracting the audio regular representation from the video regular representation to obtain a complementary representation of the video to the audio, and/or, Subtracting the video regular characterization from the audio regular characterization to obtain a complementary representation of the audio to the video.
5. The system of claim 1 or 4, wherein the modality complementary module is configured to: splicing the video regular representation or the audio regular representation with the corresponding modal complementary representation; inputting the splicing result into the full connection layer and the Softmax function to generate corresponding complementary weights; And carrying out weighted fusion on the video regular representation or the audio regular representation and the corresponding modal complementary representation by using the complementary weight to obtain the fusion representation after complementation.
6. The system of claim 1, wherein the one-step gradient extraction module is configured to extract first-order gradients of the short-time video sequence or short-time audio sequence by a two-dimensional convolution operation with a predefined first-order differential convolution kernel.
7. The system of claim 1 or 6, wherein the second order gradient extraction module is configured to extract the second order gradient of the short-time video sequence or short-time audio sequence by a two-dimensional convolution operation with a predefined second order differential convolution kernel.
8. The system of claim 1, wherein the sequence normalization and short time sequence generation module is configured to: cutting and generating the short-time video sequence and the short-time audio sequence from the long-time video and the long-time voice with fixed time window and overlapping rate; and respectively carrying out mean-variance normalization on the short-time video sequence and the short-time audio sequence based on the mean and standard deviation of the video sequence and the audio sequence.
9. A method for multi-modal depression assessment based on gradient embedding and modal complementation, characterized in that it is applied to the multi-modal depression assessment system based on gradient embedding and modal complementation according to any one of claims 1 to 8, and comprises: Acquiring an input long-time video and a corresponding long-time voice; extracting a face image sequence from an input long-time video and encoding the face image sequence into a video time sequence; Extracting Mel spectrum from the inputted long-time voice to obtain audio time sequence; Cutting from the video time sequence and the audio time sequence to generate a short-time video sequence and a short-time audio sequence with fixed lengths, and normalizing the short-time video sequence and the short-time audio sequence respectively; Respectively extracting first-order gradients of the short-time video sequence and the short-time audio sequence; respectively extracting second-order gradients of the short-time video sequence and the short-time audio sequence; The first-order gradient and the second-order gradient are fused to generate mixed gradient information, and the mixed gradient information is embedded into a corresponding short-time video sequence or short-time audio sequence to obtain a video gradient enhancement sequence and an audio gradient enhancement sequence; performing multi-stage processing and aggregation on the video gradient enhancement sequence and the audio gradient enhancement sequence respectively to obtain video vectorization representation and audio vectorization representation; Mapping the video vectorization representation and the audio vectorization representation to the same dimensional space to respectively obtain video regular characterization and audio regular characterization; calculating a difference between the video and audio regular characterizations to generate a modal complementary representation; the mode complementation representation is utilized to carry out weighted fusion on the corresponding video regular representation or audio regular representation, and fusion representation after complementation is obtained; And outputting a depression degree prediction score based on the fusion characterization.
10. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; a memory for storing a computer program; A processor for implementing the functions of the modules in the multi-modal depression assessment system based on gradient embedding and modal complementation according to any one of claims 1-8 when executing a program stored on a memory.

Description

Multi-modal depression assessment system and method based on gradient embedding and modal complementation Technical Field The invention relates to the technical field of artificial intelligence and man-machine interaction, in particular to the field of emotion calculation and multi-modal depression level prediction, and particularly relates to a multi-modal depression evaluation system and method based on gradient embedding and modal complementation. Background Early identification and assessment of depression is critical to intervention and therapy. However, traditional diagnostic methods rely heavily on specialized psychiatric doctors, are limited by imbalances in the distribution of medical resources, and are difficult for patients to obtain effective screening and diagnosis in a timely manner. In recent years, an auxiliary diagnosis technology based on artificial intelligence provides a new solution to the problems. Research shows that depression patients can show regular changes in behavior modes such as facial expression dynamics, voice rhythm characteristics and the like. Therefore, by utilizing computer vision and voice analysis technology, the automatic extraction of behavior characteristics from long-term video and voice signals of individuals and the evaluation of depression degree thereof have become a research direction with important application value. Currently, existing automated depression assessment schemes generally follow a multi-modal processing framework by first extracting high-dimensional feature sequences from video and audio modalities, respectively, and then fusing these features and inputting the prediction model. However, such methods still have significant limitations in terms of performance and reliability, particularly in terms of: First, at the feature extraction level, existing models have limited capture capability for subtle dynamic changes that are closely related to depressive states contained in audio-video sequences. These local dynamic cues are critical to accurately distinguishing healthy individuals from depressed patients, but it is difficult for existing architectures to adequately and efficiently model them. Secondly, in the aspect of sequence information utilization, the existing method is insufficient in integrating multi-level and multi-stage evolution information of time sequence characteristics. This results in that the vectorized representation of the sequence by the model may not fully and profoundly reflect its complete state change throughout the evolution process, losing some of the important information available for discrimination. Finally, in the multi-mode fusion layer, the mining and utilization of deep complementary relations between features of different modes (such as video and audio) in the prior art are still shallow. Simple feature stitching or blending strategies fail to explicitly model and exploit intrinsic discriminant complementarity information between modalities, which limits the upper limit of the ultimate characterization capability of the model, and also makes model optimization lacking explicit guidance. Therefore, there is an urgent need in the art for a novel depression degree assessment model that can more effectively capture sequence dynamic details, make full use of multi-stage timing information, and achieve more accurate modal complementation, so as to improve the accuracy and reliability of automatic assessment. Disclosure of Invention The invention provides a multi-modal depression evaluation system and method based on gradient embedding and modal complementation, which are used for solving the technical problems that the existing automatic depression evaluation scheme is insufficient in capturing local dynamic details of an audio-video sequence, insufficient in utilizing multi-stage information in a time sequence evolution process and undefined in mining and fusion mechanism of deep complementation information among multiple modalities. In a first aspect, the invention provides a multi-modal depression assessment system based on gradient embedding and modal complementation, which comprises a long-term video face extraction and coding module, a first-term image processing module and a second-term image processing module, wherein the long-term video face extraction and coding module is used for extracting a face image sequence from an input long-term video and coding the face image sequence into a video time sequence; the system comprises a long-time voice Mel spectrum extraction module for extracting Mel spectrum from inputted long-time voice to obtain audio time sequence, a sequence normalization and short-time sequence generation module for connecting the long-time video face extraction and encoding module and the long-time voice Mel spectrum extraction module for cutting and generating short-time video sequence and short-time audio sequence with fixed length from the video time sequence and the audio time sequence and