CN-121997143-A - Senior citizen emotion recognition method based on hierarchical perception enhancement and multi-granularity constraint

CN121997143ACN 121997143 ACN121997143 ACN 121997143ACN-121997143-A

Abstract

The invention discloses a senior emotion recognition method based on hierarchical perception enhancement and multi-granularity constraint, and belongs to the field of emotion recognition. The method comprises the steps of constructing an audio-video multi-mode spontaneous emotion recognition database of the elderly, preprocessing audio data and image data in the database, respectively extracting audio and visual multi-scale features by utilizing a double-stream audio-visual encoder, inputting the multi-scale features into a hierarchical perception enhancement module to obtain enhanced single-mode feature representation, realizing deep semantic association learning by the enhanced single-mode feature representation through a bidirectional loop consistency coordination module in a training stage, completing iterative optimization training by using a cross-mode multi-granularity constraint strategy covering the whole, cascading the enhanced single-mode feature representation in an reasoning stage, and outputting emotion classification through a full connection layer. Compared with the prior art, the invention obviously improves the accuracy and generalization capability of the emotion recognition task of the aged population.

Inventors

SHEN LILI
WANG CHI
WANG JINQI
WEN JIAHAO

Assignees

天津大学

Dates

Publication Date: 20260508
Application Date: 20260203

Claims (7)

1. The senior citizen emotion recognition method based on hierarchical perception enhancement and multi-granularity constraint is characterized by comprising the following steps of: s1, constructing an audio-video multi-mode spontaneous emotion recognition database of the aged, and preprocessing multi-mode data in the database; s2, inputting the preprocessed multi-mode data into a double-stream audio-visual encoder in pairs, respectively extracting audio and visual multi-scale features through an improved feature pyramid network, inputting the obtained features into a hierarchical perception enhancement module, and obtaining enhanced single-mode feature representation through hierarchical feature aggregation in a specific direction; s3, in a training stage, the enhanced single-mode characteristic representation is subjected to bidirectional cyclic consistency coordination module, and deep semantic association learning is realized by utilizing bidirectional cyclic consistency constraint; S4, in the final training stage, using a cross-mode multi-granularity constraint strategy covering the global, and respectively enabling the audio-visual single-mode prediction result to pass through an OWM_GE modulation module to iteratively optimize the training process; s5, in the reasoning stage, the reinforced single-mode characteristic representation is cascaded, and emotion classification is output through the full-connection layer.
2. The senior citizen emotion recognition method based on hierarchical perception enhancement and multi-granularity constraint of claim 1, wherein the specific workflow of preprocessing the multi-modal data in S1 is as follows: segmenting each section of audio based on speaking gap detection, setting a threshold value which is judged to be a blank gap as 30 frames, accurately segmenting an audio section of a sentence content in continuously repeated sentences, and converting the audio section into a Mel spectrogram to become a minimum unit of an audio mode in a data set; According to the condition of audio segmentation, extracting images of corresponding video paragraphs according to the frequency of 5 frames extracted per second, accurately identifying face areas through a face detection technology, carrying out image segmentation, unifying image resolution, and finally obtaining a group of image sequences focusing on the faces only to become the minimum unit of the visual mode in the data set.
3. The senior citizen emotion recognition method based on hierarchical perception enhancement and multi-granularity constraint of claim 1, wherein the specific workflow of the hierarchical perception enhancement module in S2 is as follows: the image mode adopts a semantic guidance mechanism from top to bottom, and deep features and shallower features are input into a cross-scale multi-head self-attention module together; the audio mode adopts a high-low frequency characteristic aggregation strategy from bottom to top, and shallow layer characteristics and deeper layer characteristics are input into a trans-scale multi-head self-attention module together; And (3) performing feature fusion between layers by using residual connection through the synergistic effect of the multi-stage cross-scale multi-head self-attention modules, and outputting the reinforced feature representation after single-mode layer-by-layer hierarchical perception enhancement.
4. The senior citizen emotion recognition method based on hierarchical perception enhancement and multi-granularity constraint of claim 1, wherein the specific workflow of the bidirectional cycle consistency coordination module in S3 is as follows: Inputting the reinforced audio modal characteristic representation into a visual characteristic generation branch, and generating pseudo visual characteristics by using a generator and a discriminator for countermeasure training; the reinforced visual mode characteristic is expressed to input audio characteristic generating branches, and pseudo audio characteristics are generated through countermeasure training; And respectively learning the pseudo visual features and the pseudo audio features with the respective reinforced single-mode feature representations through bidirectional cyclic consistency loss, and establishing a bidirectional feature mapping relation between modes.
5. The senior citizen emotion recognition method based on hierarchical perception enhancement and multi-granularity constraint according to claim 1, wherein the cross-modal multi-granularity constraint strategy covering the global in S4 specifically comprises the following contents: At the pixel level, imposing attention consistency constraints between the cross-modal representations output by each level of cross-scale multi-headed self-attention module; At the feature level, applying an enhanced contrast loss constraint prior to feature fusion; In the deep space, applying bidirectional cycle consistency constraint; At the decision level, standard cross entropy loss is used; weighting the constraints of the different levels to obtain the overall loss function of the training phase.
6. A computer device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set or instruction set loaded and executed by the processor to implement the senior citizen emotion recognition method based on hierarchical perception enhancement and multi-granularity constraint of any of claims 1-5.
7. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set loaded and executed by a processor to implement the senior citizen emotion recognition method based on hierarchical perception enhancement and multi-granularity constraint of any of claims 1-5.

Description

Senior citizen emotion recognition method based on hierarchical perception enhancement and multi-granularity constraint Technical Field The invention relates to the field of emotion recognition, in particular to a senior emotion recognition method based on hierarchical perception enhancement and multi-granularity constraint. Background Emotion is a core element of human mental activity and plays a key role in cognitive decision and social interaction. Affective health is not only an important standard for mental health, but also a basic guarantee for maintaining social functions. With the deepening of the active aging concept, the affective health of the elderly is widely concerned, and the affective recognition mechanism of the elderly also presents multidimensional research value due to the special life stage characteristics. Therefore, the emotion recognition method conforming to the characteristics of the aged group is constructed, is a core link for realizing health aging, and is also an important break for relieving the pressure of households and society care. Existing emotion recognition methods can be generally divided into two types, namely, recognition methods based on non-physiological indexes (such as facial expressions and voice intonation) and recognition methods based on physiological signals (such as electroencephalogram and electrocardio). Compared with the technical bottlenecks of high equipment invasiveness, strict environmental restrictions and the like existing in the physiological signal acquisition process, the non-physiological signal has become the research direction with the most development potential in the emotion calculation field by virtue of the advantages of non-contact acquisition, remarkable cost benefit and the like. Previous psychology has further demonstrated that audio and visual modalities dominate the emotional perception contributions. In the development of audio-visual fusion multi-mode emotion recognition technology, early methods are mostly based on manual design features, which greatly limit emotion recognition performance. With the breakthrough progress of convolutional neural networks, researchers have achieved automated learning of cross-modal features. The existing research can enable the vision and audio modes to respectively capture the special space-time dynamic characteristics and acoustic rhythm characteristics in the independent encoding and decoding process, and enhance the characterization capability of the mode specificity. However, the existing methods do not fully simulate the layering process of human audiovisual cognition, limiting the understanding ability of the model to model semantics in more complex scenarios. On the basis of realizing mode specificity characterization, cross-mode semantic learning becomes a core challenge for restricting emotion recognition performance improvement. Because of the inherent differences in audiovisual signals, their potential characterizations tend to be distributed across heterogeneous feature spaces, a key issue is how to build an efficient cross-modal interaction mechanism. Early studies attempted to build superficial associations between modalities, but failed to mimic the dynamic collaboration mechanisms between audio-visual modalities in the human nervous system. There have been studies attempting to mine audiovisual potential similarity relationships by imposing multiple constraints. However, most approaches are limited to explicit feature space and fail to fully explore implicit higher-order semantic associations. It can be seen that the existing constraint mechanism-based audiovisual similarity characterization learning does not fully cover the requirement of multiple granularities. Specifically, the prior art has at least the following disadvantages: The emotion recognition method based on the audio frequency focuses on direct mapping of acoustic features and semantic tags, but neglects a layered perception mechanism of human hearing, which may cause that a model is limited in a complex emotion scene, the emotion recognition method based on the visual frequency stays in a shallower perception characterization layer basically, a hierarchical cognitive guiding mechanism is lacked, which may cause robustness defects in the complex emotion scene, and meanwhile, the method still has defects in cross-modal fine-granularity semantic alignment and multi-granularity learning targets, and particularly, the method still faces key challenges in balancing single-modal information integrity requirements and realizing cross-modal semantic depth alignment. In order to solve the problems, the invention provides a method capable of strengthening the single-mode level representation of the audio-visual and meeting the multi-granularity alignment requirement of the audio-visual cross-mode so as to improve the accuracy and generalization capability of the emotion recognition task of the aged. Disclosure of Invention The