CN-121983284-A - Model training method, multi-mode depression screening method and electronic equipment

CN121983284ACN 121983284 ACN121983284 ACN 121983284ACN-121983284-A

Abstract

The disclosure provides a model training method, a multi-mode depression screening method and electronic equipment, and belongs to the technical field of computers. The model training method comprises the steps of dividing a video sample into a plurality of fragments, carrying out feature extraction on each fragment to obtain a plurality of features to be processed, fusing an mth visual feature and an nth audio feature in the features to be processed to obtain a first fused feature, fusing a kth text feature and an nth audio feature in the features to be processed to obtain a second fused feature, processing the first fused feature and the second fused feature by using a cross-modal model to obtain a cross-modal fused feature, processing the cross-modal fused feature by using a behavior pattern analysis model to obtain a depression screening result, and training the cross-modal model and the behavior pattern analysis model according to the depression screening result and labeling information. According to the method, the cross-modal model is trained by using the cross-modal characteristics, and the cross-modal model is used for carrying out depression screening, so that the accuracy of depression screening fruits can be effectively improved.

Inventors

GENG WUJUN
TANG YIHAN

Assignees

瓯江实验室

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (19)

1. A model training method, comprising: detecting each frame in a video sample so as to take a frame capable of reflecting depression specific characteristics as an effective frame and a frame incapable of reflecting the depression specific characteristics as an ineffective frame; dividing the video sample into a plurality of segments having a predetermined length of time, wherein an effective frame duty cycle of each segment is greater than a duty cycle threshold; extracting the characteristics of each segment to obtain a plurality of characteristics to be processed, wherein each characteristic to be processed is a visual mode characteristic, an audio mode characteristic or a text mode characteristic; fusing the mth visual mode feature and the nth audio mode feature in the features to be processed to obtain a first fused feature, M is the total number of the visual mode characteristics, N is the total number of audio mode characteristics; Fusing the kth text modal feature and the nth audio modal feature in the features to be processed to obtain a second fused feature, K is the total number of visual mode characteristics; Processing the first fusion feature and the second fusion feature by using a cross-modal model with a mutual attention mechanism to obtain a cross-modal fusion feature; Processing the cross-modal fusion characteristics by using a behavior pattern analysis model to obtain a depression screening result; determining a loss value according to the depression screening result and the video sample labeling information; training the cross-modal model and the behavior pattern analysis model using the loss values.
2. The model training method according to claim 1, wherein, The mutual attention mechanism comprises a first sub-mechanism for querying corresponding visual mode characteristics and text mode characteristics by utilizing the audio mode characteristics, a second sub-mechanism for querying corresponding audio mode characteristics by utilizing the visual mode characteristics and the text mode characteristics, a third sub-mechanism for querying corresponding text mode characteristics by utilizing the visual mode characteristics, and a fourth sub-mechanism for querying corresponding visual mode characteristics by utilizing the text mode characteristics.
3. The model training method of claim 1, wherein the fusing the mth visual modality feature and the nth audio modality feature of the features to be processed to obtain a first fused feature includes: coding the mth visual mode feature by using a visual coder to obtain a visual coding feature; Encoding the nth audio mode feature by using an audio encoder to obtain an audio encoding feature; Performing linear projection processing on the visual coding features to obtain projection features; pooling the audio coding features to obtain pooled features; and obtaining the first fusion characteristic according to the projection characteristic and the pooling characteristic.
4. A model training method as claimed in claim 3, wherein said deriving said first fused feature from said projected feature and said pooled feature comprises: and calculating a weighted sum of the projection feature and the pooling feature to obtain the first fusion feature.
5. The model training method of claim 1, wherein the fusing the kth text modality feature and the nth audio modality feature in the features to be processed to obtain a second fused feature comprises: Coding the kth text modal feature by using a text coder to obtain a text coding feature; Encoding the nth audio mode feature by using an audio encoder to obtain an audio encoding feature; and fusing the text coding feature and the audio coding feature to obtain the second fusion feature.
6. The model training method of claim 1, wherein the detecting each frame in the video samples comprises: Detecting gaze characteristics, neuromotor characteristics, behavior synchronization characteristics, and fundamental frequency mutation rates of an audio stream corresponding to an i-th frame of a target subject in the i-th frame, wherein I is the total frame number of the video samples; Calculating the score of the staring characteristic, the score of the neural motility characteristic, the score of the behavior synchronization characteristic and the score of the fundamental frequency mutation rate respectively to obtain a clinical information total score; If the total score of the clinical information is greater than or equal to a score threshold, determining the ith frame as a valid frame; And if the total score of the clinical information is smaller than the score threshold value, determining the ith frame as an invalid frame.
7. The model training method according to claim 6, wherein, The score of the staring characteristic and the staring direction deviation of the target main body are in positive correlation; The score of the neuromotor characteristics is in positive correlation with the activity intensity of a given muscle; the score of the behavior synchronization characteristic and the synchronization degree of the mouth activity or the facial expression of the target main body when the voice rhythm of the target main body is disordered are in positive correlation; The score of the fundamental mutation rate is determined by the magnitude relation of the fundamental mutation rate to a predetermined mutation rate threshold.
8. The model training method according to claim 7, wherein, The designated muscles include at least one of orbicular, zygomatic, and mandibular muscles of the subject.
9. The model training method of claim 6, further comprising: and if at least one of the staring characteristic, the neuromotor characteristic and the behavior synchronization characteristic of the target main body in the ith frame cannot be detected, determining that the ith frame is an invalid frame.
10. The model training method of claim 6, wherein the determining that the ith frame is an invalid frame if the clinical information total score is less than the score threshold comprises: if the total score of the clinical information is smaller than the score threshold value, detecting whether voice rhythm disorder exists in the audio stream corresponding to the ith frame; and if the voice rhythm disorder does not exist in the audio stream corresponding to the ith frame, determining the ith frame as an invalid frame.
11. The model training method of claim 10, further comprising: And if the voice rhythm disorder exists in the audio stream corresponding to the ith frame, determining the ith frame as a valid frame.
12. The model training method of claim 1, wherein the feature extraction of each segment comprises: extracting a plurality of semantic features with different modalities from each segment; mapping each semantic feature to a specified semantic space to obtain a plurality of mapping features; aligning the plurality of mapping features to obtain a position embedded feature of each mapping feature; And correcting each mapping feature according to the modal condition embedded feature and the position embedded feature of each mapping feature to obtain a plurality of features to be processed.
13. The model training method of claim 12, wherein the aligning the plurality of mapping features comprises: Detecting the frame number corresponding to each mapping feature in each segment to obtain a plurality of frame numbers; taking the maximum frame number of the plurality of frame numbers as a reference frame number; Determining the mapping proportion of each mapping feature according to the ratio of the reference frame number to the frame number corresponding to each mapping feature; and generating a position embedded feature of each mapping feature according to the mapping proportion of each mapping feature and the frame index of each mapping feature.
14. The model training method of claim 12, wherein the modifying each of the mapped features according to the modality condition embedded features and the location embedded features of each of the mapped features comprises: Adding the j-th mapping feature, the modal condition embedded feature of the j-th mapping feature and the position embedded feature of the j-th mapping feature to obtain the j-th feature to be processed, J is the total number of mapping features.
15. The model training method of any one of claims 1-14, wherein the processing the cross-modal fusion features using a behavioral pattern analysis model to obtain depression screening results comprises: Masking the cross-modal fusion features by using a preset masking matrix to obtain masking fusion features; and processing the mask fusion characteristics by using the behavior pattern analysis model to obtain the depression screening result.
16. A multi-modal depression screening method comprising: detecting each frame in the video to be detected so as to take a frame which can reflect depression specific characteristics as an effective frame and a frame which cannot reflect the depression specific characteristics as an ineffective frame; Dividing the video to be detected into a plurality of fragments with a preset duration, wherein the effective frame duty ratio of each fragment is larger than a duty ratio threshold value; extracting the characteristics of each segment to obtain a plurality of characteristics to be processed, wherein each characteristic to be processed is a visual mode characteristic, an audio mode characteristic or a text mode characteristic; fusing the mth visual mode feature and the nth audio mode feature in the features to be processed to obtain a first fused feature, M is the total number of the visual mode characteristics, N is the total number of audio mode characteristics; Fusing the kth text modal feature and the nth audio modal feature in the features to be processed to obtain a second fused feature, K is the total number of visual mode characteristics; Processing the first fusion feature and the second fusion feature by using a cross-modal model with a mutual attention mechanism to obtain a cross-modal fusion feature, wherein the cross-modal model is trained by using the model training method of any one of claims 1-15; Processing the cross-modal fusion features by using a behavior pattern analysis model to obtain a depression screening result, wherein the behavior pattern analysis model is trained by using the model training method of any one of claims 1-15.
17. An electronic device, comprising: A memory; a processor coupled to the memory, the processor configured to perform the method of any of claims 1-16 based on instructions stored by the memory.
18. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the method of any one of claims 1-16.
19. A computer program product comprising computer instructions which, when executed by a processor, implement the method of any one of claims 1-16.

Description

Model training method, multi-mode depression screening method and electronic equipment Technical Field The disclosure relates to the technical field of computers, in particular to a model training method, a multi-mode depression screening method and electronic equipment. Background Post-stroke depression (PSD) is a common affective disorder complication of cerebral stroke and can significantly interfere with the rehabilitation process of cerebral stroke patients. Therefore, the method can timely identify and intervene in post-stroke depression, and has important clinical significance for improving the rehabilitation effect and long-term prognosis of patients. The traditional scale evaluation mode has the defects of strong subjectivity, low efficiency and the like. With the rapid development of artificial intelligence technology, machine learning methods have been applied to depression screening tasks. Disclosure of Invention The inventors have noted that in the related art, multi-modal model-based depression screening studies have focused mainly on visual and text modalities, with comprehensive assessment of individual mental states through facial expressions, eye movements, interview text, and the like. However, the acoustic features contained in speech have a high sensitivity to depressed mood, current research lacks attention in audio modalities, and there is a need to build a multimodal collaborative architecture that fuses visual-text-audio to improve the effectiveness of depression screening. Accordingly, the model training method is used for training the cross-modal model by fusing the visual modal characteristics, the text modal characteristics and the audio modal characteristics of the sample, and performing depression screening by using the cross-modal model, so that the accuracy of a depression screening result can be effectively improved. In a first aspect of the disclosure, a model training method is provided, comprising detecting each frame in a video sample so as to take a frame capable of reflecting depression-specific features as an effective frame and a frame incapable of reflecting the depression-specific features as an ineffective frame, dividing the video sample into a plurality of segments with a predetermined duration, wherein the effective frame duty ratio of each segment is greater than a duty ratio threshold, extracting features from each segment to obtain a plurality of features to be processed, wherein each feature to be processed is a visual mode feature, an audio mode feature or a text mode feature, fusing an mth visual mode feature and an nth audio mode feature in the features to be processed to obtain a first fused feature,M is the total number of the visual mode characteristics,Fusing the kth text modal feature and the nth audio modal feature in the features to be processed to obtain a second fused feature,K is the total number of visual mode features, the first fusion features and the second fusion features are processed by using a cross-mode model with a mutual attention mechanism to obtain cross-mode fusion features, the cross-mode fusion features are processed by using a behavior mode analysis model to obtain a depression screening result, a loss value is determined according to the depression screening result and video sample labeling information, and the cross-mode model and the behavior mode analysis model are trained by using the loss value. In some embodiments, the mutual attention mechanism includes a first sub-mechanism that queries corresponding visual modality features and text modality features with audio modality features, a second sub-mechanism that queries corresponding audio modality features with visual modality features and text modality features, a third sub-mechanism that queries corresponding text modality features with visual modality features, and a fourth sub-mechanism that queries corresponding visual modality features with text modality features. In some embodiments, the fusing the mth visual mode feature and the nth audio mode feature in the features to be processed to obtain a first fused feature includes encoding the mth visual mode feature by a visual encoder to obtain a visual encoding feature, encoding the nth audio mode feature by an audio encoder to obtain an audio encoding feature, performing linear projection processing on the visual encoding feature to obtain a projection feature, performing pooling processing on the audio encoding feature to obtain a pooling feature, and obtaining the first fused feature according to the projection feature and the pooling feature. In some embodiments, the deriving the first fused feature from the projection feature and the pooling feature includes calculating a weighted sum of the projection feature and the pooling feature to derive the first fused feature. In some embodiments, the fusing the kth text modal feature and the nth audio modal feature in the features to be processed to obtain a second fused