CN-122000017-A - Multi-mode behavior data processing system

CN122000017ACN 122000017 ACN122000017 ACN 122000017ACN-122000017-A

Abstract

The invention discloses a multi-modal behavior data processing system which comprises a data acquisition module, a feature extraction layer, a dynamic attention weight layer, a multi-modal fusion layer and a downstream task decision layer, wherein the data acquisition module guides human-computer interaction through an international neuropsychiatric interview tool matched with a DSM-5 standard to acquire audio and video stream data, the feature extraction layer extracts video feature vectors (comprising facial action unit activation intensity and the like), audio feature vectors (comprising Mel frequency cepstrum coefficients and the like) and text feature vectors (generated by a deep language model after automatic speech recognition transcription), the dynamic attention weight layer combines data quality, symptomic prior knowledge and cross-modal relevance to generate dynamic fusion weight, the multi-modal fusion layer weights, splices and reduces dimensions to obtain fusion feature vectors, and the downstream task decision layer completes evaluation and generates a multi-modal behavior index evaluation report. The system is deployed in a non-invasive mode, the risk is controllable, and the robustness and the accuracy of evaluation can be improved.

Inventors

KUANG LI
Diao Liguo
HU JINHUI
HAN YUSHUANG
LIU JIE
SUN JINGSI
CHEN JIANMEI
ZHANG QI
WANG WO
AI MING
CAO JUN
YANG JIAN
XU XIAOMING

Assignees

新梅奥健康管理研究院(重庆)有限责任公司
重庆医科大学附属第一医院

Dates

Publication Date: 20260508
Application Date: 20251122

Claims (10)

1. A multi-modal behavioural data processing system, characterized by: The system comprises a data acquisition module, a feature extraction layer, a dynamic attention weight layer, a multi-mode fusion layer and a downstream task decision layer; the data acquisition module is deployed on the general computing equipment and is used for acquiring audio and video stream data of a user in a man-machine interaction question-answering mode guided by an international neuropsychiatric interview tool matched with a DSM-5 standard; the feature extraction layer is connected with the data acquisition module and is used for extracting video feature vectors, audio feature vectors and text feature vectors from the audio-video stream data in parallel; The dynamic attention weight layer is connected with the feature extraction layer and is used for integrating data quality, symptomology priori knowledge and cross-modal feature association degree to respectively generate dynamic fusion weights aiming at specific psychopsychological symptoms for the video feature vector, the audio feature vector and the text feature vector; The multi-modal fusion layer is respectively connected with the feature extraction layer and the dynamic attention weight layer and is used for carrying out weighting, splicing and dimension reduction on the video feature vector, the audio feature vector and the text feature vector based on the fusion weight to obtain a multi-modal fusion feature vector; The downstream task decision layer is connected with the multi-modal fusion layer and is used for completing the evaluation task of the mental state based on the multi-modal fusion feature vector, generating a structured multi-modal behavioral index evaluation report containing multi-modal behavioral quantitative indexes and providing objective physiological index references for diagnosis of mental and psychological diseases such as clinical depression, anxiety and the like.
2. A multi-modal behavioural data processing system as claimed in claim 1, wherein: When the feature extraction layer extracts the video feature vector, the feature extraction layer specifically extracts the facial expression, head posture, eye movement and limb movement related features of the user, and the video feature vector comprises quantization indexes of activation intensity, head posture angle (pitching, yawing and rolling), gazing direction, blink frequency and limb movement amplitude of a facial action unit (FACS AU).
3. A multi-modal behavioural data processing system as claimed in claim 1, wherein: When the feature extraction layer extracts the audio feature vector, a digital signal processing technology based on LibROSA and Praat algorithm libraries is adopted to extract acoustic and prosodic features, and the audio feature vector comprises quantization indexes of Mel Frequency Cepstrum Coefficient (MFCCs), fundamental frequency (Pitch), pitch, harmonic Noise Ratio (HNR), perturbation (Jitter), jitter (Shimmer), speech speed and silence period proportion.
4. A multi-modal behavioural data processing system as claimed in claim 1, wherein: When the feature extraction layer extracts the text feature vector, firstly, an audio stream in the audio-video stream data is transcribed into a text through an Automatic Speech Recognition (ASR) technology, then the text is converted into a semantic representation vector by utilizing a pre-trained deep language model, the deep language model is a BERT model or a Large Language Model (LLM) finely tuned in the medical field, and the text feature vector can capture deep semantics, themes and emotion tendencies of the text.
5. A multi-modal behavioural data processing system as claimed in claim 1, wherein: The dynamic attention weight layer comprises a data quality evaluation module, wherein the data quality evaluation module is used for evaluating the quality of input data corresponding to the video feature vector, the audio feature vector and the text feature vector in real time and generating a data quality score vector; The video data quality is comprehensively evaluated through confidence level of face detection, the ratio of a face area in a picture, illumination uniformity (calculated through image brightness standard deviation) and definition (calculated through Laplacian operator to calculate image gradient), the audio data quality is evaluated through calculating signal-to-noise ratio (SNR), background noise level is estimated in a non-voice segment and signal level is estimated in a voice segment, the ratio of the background noise level and the signal level is SNR), and the text data quality is evaluated through confidence score of an automatic voice recognition (ASR) model; And the range of the mass fraction value of each mode in the data mass fraction vector is [0,1], wherein 0 represents that the data is completely unavailable, and 1 represents high-quality data.
6. A multi-modal behavioural data processing system as claimed in claim 1, wherein: the dynamic attention weighting layer includes a symptomatic prior knowledge base storing a "symptom-modality" static weight table defined by clinical psychologists for providing base weights of the video, audio and text feature vectors for specific mental symptoms to be evaluated (e.g. "mood down") that embody typical performance strengths of different mental symptoms on different modalities.
7. A multi-modal behavioural data processing system as claimed in claim 1, wherein: The dynamic attention weight layer comprises a trans-former architecture-based cross-modal attention network, the cross-modal attention network receives the video feature vector, the audio feature vector and the text feature vector as input, learns the interdependence relationship among the modal feature vectors through a self-attention mechanism and a cross-attention mechanism, and outputs an attention score vector; When the attention score of a certain mode is calculated, the feature vector of the mode is used as a Query vector (Query), the feature vectors of the other two modes are spliced to be used as a Key vector (Key) and a Value vector (Value), and the attention score vector reflects the salient degree and the information quantity of each mode feature in the current data segment.
8. A multi-modal behavioural data processing system as claimed in claim 1, wherein: When the dynamic attention weight layer generates the fusion weight, the calculation is firstly carried out based on the basic weight of the symptomatology priori knowledge base, the quality score vector of the data quality evaluation module and the attention score vector of the cross-mode attention network, and then the final fusion weight is obtained through normalization processing (the sum of the fusion weights of all modes is ensured to be 1).
9. A multi-modal behavioural data processing system as claimed in claim 1, wherein: When the multimodal fusion layer processes the video feature vector, the audio feature vector and the text feature vector, the multimodal fusion layer specifically comprises: Step 1, multiplying the final fusion weights with corresponding modal feature vectors respectively to obtain weighted feature vectors of all modes; step 2, splicing all the weighted feature vectors to form a high-dimensional pre-fusion feature vector; step 3, inputting the feature vector before fusion into a multi-layer perceptron (MLP) for nonlinear transformation and dimension reduction processing to obtain the multi-mode fusion feature vector; the multi-layer perceptron (MLP) learns the optimal feature combination mode through training to adapt to the downstream evaluation task.
10. A multi-modal behavioural data processing system as claimed in claim 1, wherein: the evaluation task of the downstream task decision layer comprises a classification task and a regression task; The classifying task is used for judging whether a user has a depression risk or not, and outputting the probability of the corresponding risk category by inputting the multi-mode fusion feature vector into a Softmax classifier; The regression task is used for predicting the depression severity score of the user, and a specific scoring result is output by inputting the multi-modal fusion feature vector into a linear regression model or a gradient lifting tree model (such as XGBoost); The downstream task decision layer is further used for comparing the evaluation result with a normal crowd baseline and integrating the evaluation result into the multi-modal behavioral index evaluation report containing numerical values and charts; the multimodal behavioral indicator assessment report includes: The speech analysis result is quantification indexes of speech speed, pitch and rhythm; expression analysis results, namely facial Action Unit (AU) recognition results and emotion-related dynamics characteristics; semantic analysis results, namely keywords related to psychology symptom characteristics in the dialogue and theme recognition results; Trend graphs of all quantitative indexes and comparison results of the trend graphs and the base lines of normal people; the system does not directly provide a disease diagnosis conclusion, the final disease diagnosis and treatment decision is made by a medical practitioner after integrating clinical manifestations, the evaluation report and other related examination results, and the system adopts a non-invasive design and does not generate physical or biological risks.

Description

Multi-mode behavior data processing system Technical Field The invention relates to a data processing system, in particular to a multi-mode behavior data processing system. Background At present, the clinical diagnosis of emotional mental diseases such as depression, anxiety and the like is mostly dependent on a traditional subjective scale assessment method, the method is required to be judged by clinical experience of a medical practitioner, is greatly influenced by factors such as subjective cognition of the medical practitioner, expression accuracy of a patient and the like, is easy to cause the problems of low assessment accuracy and more misdiagnosis, and is difficult to meet the requirement of clinic on objective diagnosis basis. In order to improve the current situation, part of technologies try to extract behavioral characteristics through single-mode data (such as text, audio or video) for auxiliary evaluation, but the single-mode has obvious limitations that the single-mode data is easy to be limited by the language expression capability of a patient and difficult to capture non-language emotion signals, the audio data is easy to be interfered by environmental noise and has insufficient acoustic characteristic stability, the video data is easy to be influenced by illumination and shooting angles, and the feature extraction accuracy of a facial action unit, limb actions and the like is limited and cannot fully reflect the mental and psychological states of the patient. In addition, a small number of multi-mode fusion technologies try to integrate multi-dimensional data, but a static weight distribution mode is adopted in many cases, so that weights cannot be dynamically adjusted according to the quality of real-time input data (such as video definition, audio signal to noise ratio and text recognition confidence level) and the performance difference of different mental symptoms on each mode (such as different expressive force of 'emotion low' in video facial expression and audio intonation), so that the fusion effect is poor, evaluation robustness and accuracy still remain to be improved, and clinical accurate auxiliary diagnosis needs are difficult to effectively support. Disclosure of Invention The technical problem to be solved by the invention is to overcome the defects of the technology and provide a multi-mode behavior data processing system. In order to solve the technical problems, the technical scheme provided by the invention is that the multi-modal behavior data processing system comprises a data acquisition module, a feature extraction layer, a dynamic attention weight layer, a multi-modal fusion layer and a downstream task decision layer; the data acquisition module is deployed on the general computing equipment and is used for acquiring audio and video stream data of a user in a man-machine interaction question-answering mode guided by an international neuropsychiatric interview tool matched with a DSM-5 standard; the feature extraction layer is connected with the data acquisition module and is used for extracting video feature vectors, audio feature vectors and text feature vectors from the audio-video stream data in parallel; The dynamic attention weight layer is connected with the feature extraction layer and is used for integrating data quality, symptomology priori knowledge and cross-modal feature association degree to respectively generate dynamic fusion weights aiming at specific psychopsychological symptoms for the video feature vector, the audio feature vector and the text feature vector; The multi-modal fusion layer is respectively connected with the feature extraction layer and the dynamic attention weight layer and is used for carrying out weighting, splicing and dimension reduction on the video feature vector, the audio feature vector and the text feature vector based on the fusion weight to obtain a multi-modal fusion feature vector; The downstream task decision layer is connected with the multi-modal fusion layer and is used for completing the evaluation task of the mental state based on the multi-modal fusion feature vector, generating a structured multi-modal behavioral index evaluation report containing multi-modal behavioral quantitative indexes and providing objective physiological index references for diagnosis of mental and psychological diseases such as clinical depression, anxiety and the like. As an improvement, when the feature extraction layer extracts the video feature vector, the feature extraction layer specifically extracts the facial expression, head posture, eye movement and limb movement related features of the user, and the video feature vector comprises quantization indexes of activation intensity, head posture angle (pitch, yaw and roll), gaze direction, blink frequency and limb movement amplitude of a facial movement unit (FACS AU). As an improvement, when the feature extraction layer extracts the audio feature vector, a digital signal processing