CN-121808715-B - Text-first multi-modal emotion analysis method and system
Abstract
The invention provides a text-first multi-mode emotion analysis method and a text-first multi-mode emotion analysis system, which are characterized in that private expert networks of texts, videos and audios are respectively constructed and used for capturing unique emotion expression characteristics of each mode, meanwhile, a shared expert network is introduced to model cross-mode general emotion semantics, so that collaborative modeling of multi-mode information is realized, then, an emotion analysis process is started by taking a text mode as a leading mode, first, preliminary judgment is given out based on the text information, then, video and audio information are gradually introduced as required according to confidence level conditions to carry out incremental supplement for emotion classification prediction, and an on-demand decoding strategy consistent with human cognition is realized, so that a final emotion category is obtained. The invention simulates the cognition process of human 'on-demand integration' multisource information, reduces the reasoning time and resource consumption, ensures that the model decision process is more natural and interpretable, and is more close to the characteristics of unbalanced and unequal information of different modes in practical application.
Inventors
- CHEN JIAJU
- WU CHANGXING
- ZHU ZHILIANG
- YANG YALIAN
Assignees
- 华东交通大学
Dates
- Publication Date
- 20260508
- Application Date
- 20260310
Claims (8)
- 1. A method for text-first multimodal emotion analysis, said method comprising the steps of: step 1, given an example, extracting a preliminary semantic representation of text, video and audio of the example; step 2, respectively constructing private expert networks for the modes of the text, the video and the audio, and generating private emotion feature vectors of the corresponding modes in the preliminary semantic representation of the text, the video and the audio through the private expert networks; generating universal emotion feature vectors of corresponding modes in the preliminary semantic representations of the text, the video and the audio by using a shared expert network; step 3, fusing the private emotion feature vector and the text mode part of the universal emotion feature vector to generate a final emotion feature vector of the single text; Step 4, starting an emotion analysis process by taking a text mode as a leading mode, carrying out emotion classification prediction by utilizing a final emotion feature vector of a single text to obtain emotion types, and calculating the confidence coefficient of the emotion types; Step 5, gradually introducing a private emotion feature vector and a general emotion feature vector corresponding to video and audio as required according to the confidence level of the emotion category, performing incremental supplement, and performing emotion classification prediction to realize an on-demand decoding strategy consistent with human cognition, thereby obtaining a final emotion category, and specifically comprising the following steps: marking the emotion type based on text prediction as a first emotion type; if the confidence level of the first emotion type is smaller than or equal to the confidence level threshold, the first emotion type is judged to be high in reliability, information of other modes is not required to be introduced, and the first emotion type is used as a final emotion type; if the confidence coefficient of the first emotion type is larger than the confidence coefficient threshold value, the confidence coefficient of the first emotion type is judged to be low, more emotion information needs to be extracted from the input video for more accurate prediction, the general emotion feature vector and the private emotion feature vector of the video are calculated through a second LSTM unit to obtain a vector of the final emotion feature of the fused text and the video information, then emotion classification prediction is carried out to obtain a second emotion type, and the confidence coefficient of the second emotion type is calculated and corresponds to the following relational expression: ; Wherein, the Is the final emotion feature vector of a single text, As a function of the entropy used to calculate the probability distribution, For the second LSTM unit for fusing text and video information, To fuse the vectors of the final emotional characteristics of the text and video information, For effective video information filtered by the input gate, To utilize the classification network of text and video information, Is a general emotion feature vector of a video, Is a private emotion feature vector of the video, In the case of the second emotion classification, Confidence for the second emotion classification; comparing the confidence coefficient of the second emotion type with a confidence coefficient threshold again, if the confidence coefficient of the second emotion type is smaller than or equal to the confidence coefficient threshold, judging that the confidence coefficient of the second emotion type is high, and taking the second emotion type as a final emotion type without introducing information of other modes; If the confidence coefficient of the second emotion type is larger than the confidence coefficient threshold, the confidence coefficient of the second emotion type is judged to be low, more emotion information needs to be extracted from the input audio for more accurate prediction, the general emotion feature vector and the private emotion feature vector of the audio are calculated through a third LSTM unit to obtain the vector of the final emotion feature of the three modal information, then emotion classification prediction is carried out to obtain a third emotion type, the third emotion type is taken as the final emotion type, and the following relational expression exists correspondingly: ; Wherein, the To fuse the vectors of the final emotion features of the three modality information, For the third LSTM unit for fusing three modality information, For effective audio information filtered by the input gate, Is a generic emotion feature vector of the audio, A private emotion feature vector for audio; the training process of the private expert network and the shared expert network specifically comprises the following steps: Three classification networks for auxiliary training are constructed, the general emotion characteristics and the private emotion characteristics of each mode are taken as input, the prediction result of emotion categories is calculated, and the corresponding process has the following relational expression: ; Wherein, the 、 And A classification network for training assistance corresponding to text, video and audio respectively, 、 And Emotion type predictors of text only, video only and audio only respectively, Is a private emotion feature vector of a text, Is a general emotion feature vector of a text, Is a private emotion feature vector of the video, The universal emotion feature vector is a universal emotion feature vector of the video; Defining a first multi-task learning cost according to the prediction result of the emotion category and the real category, wherein the corresponding process has the following relational expression: ; Wherein, the For the training set to be marked by the human beings, For inputting instances CE is a function that calculates the cross entropy of two discrete probability distributions, Learning a cost for the first multitask; constructing and defining an orthogonal regularization term according to the general emotion characteristics and the private emotion characteristics of each mode, wherein the corresponding process has the following relational expression: ; Wherein, the For the transposition of the vectors, Is the inner product of the vectors and, Is an orthogonal regularization term; Learning costs according to a first multitask And orthogonal regularization term Defining a first total cost of a shared-private expert joint training phase, wherein the corresponding process has the following relation: ; Wherein, the For weights corresponding to orthogonal regularization terms, Is the first total cost; by minimizing the first total cost And until convergence, realizing the joint training of the shared-private expert by optimizing parameters in the private expert network and the shared expert network.
- 2. The method according to claim 1, wherein in step 1, the preliminary semantic representations of the text, video and audio of the instance are extracted, and the following relations exist correspondingly: ; Wherein, the 、 And Respectively a text encoder, a video encoder and an audio encoder, 、 And Semantic matrix representations of input text, video and audio respectively, Respectively text, video and audio, X represents a given instance.
- 3. The method for multi-modal emotion analysis with text preference according to claim 2, wherein in step 2, private emotion feature vectors of corresponding modalities in the preliminary semantic representations of text, video and audio are generated through a private expert network, and the following relational expressions exist correspondingly: ; Wherein, the 、 And Private expert networks corresponding to text, video and audio, respectively, are each composed of a stack of multiple transducer layers.
- 4. A method of multi-modal emotion analysis with text prioritization as claimed in claim 3 wherein in step 2, the generic emotion feature vectors for the corresponding modalities in the preliminary semantic representations of text, video and audio are generated using a shared expert network, with the correspondence having the following relationship: ; Wherein, the To share the expert network, it is composed of a stack of multiple transducers layers.
- 5. The method for multi-modal emotion analysis with text preference according to claim 4, wherein in step 3, the private emotion feature vector is fused with the text modal portion of the generic emotion feature vector to generate a final emotion feature vector of a single text, and the following relation exists correspondingly: ; Wherein, the For the first LSTM unit for fusing text information, For effective text information filtered by the input gate, As an initial state vector that can be learned, Is a vector concatenation operation.
- 6. The method for multi-modal emotion analysis with text preference according to claim 5, wherein in step 4, the emotion analysis process is started with the text mode as the dominant, emotion classification prediction is performed by using the final emotion feature vector of a single text, emotion categories are obtained, confidence of the emotion categories is calculated, and the following relation exists correspondingly: ; Wherein, the In order to use the classification network of text information only, In order to predict the emotion type distribution, Is the confidence of the predicted outcome.
- 7. The method of claim 6, wherein the steps 1 to 5 are performed by a text-prioritized multimodal emotion analysis model, and the training method of the text-prioritized multimodal emotion analysis model comprises the steps of: Given a training set, wherein the training set comprises a plurality of given examples and corresponding real categories; Taking the training set as input, executing the steps 1 to 5, closing a confidence coefficient calculation and comparison part of the emotion type in the execution process, and forcing the full-path forward propagation to obtain a first emotion type, a second emotion type and a third emotion type; Defining a second multi-task learning cost according to a given real category, a first emotion category, a second emotion category and a third emotion category, wherein the corresponding process has the following relation: ; Wherein, the For the training set to be marked by the human beings, For inputting instances CE is a function that calculates the cross entropy of two discrete probability distributions, Learning a cost for the second multitask; taking the third emotion type as a teacher, taking the first emotion type and the second emotion type as students, and defining knowledge distillation cost, wherein the corresponding process has the following relation: ; Wherein, the In order to be aware of the cost of distillation, To calculate the KL divergence of the two probability distributions; By utilizing the information output by each LSTM unit, defining a mutual information constraint item, wherein the corresponding process has the following relation: ; Wherein, the To calculate the function of the cosine similarity of the two vectors, Is a multilayer feedforward neural network; learning costs according to a second multitask Cost of knowledge distillation And mutual information constraint item Defining a second total cost of the global training phase The corresponding process has the following relation: ; Wherein, the And Weights for knowledge distillation cost and mutual information constraint terms respectively, Is the second total cost; by minimizing the second total cost And until convergence, optimizing parameters in the text-priority multi-modal emotion analysis model to realize training of the text-priority multi-modal emotion analysis model.
- 8. A text-first multimodal emotion analysis system, wherein the system applies a text-first multimodal emotion analysis method according to any of claims 1 to 7, the system comprising: a shared-private expert based encoding module for: Given an instance, extracting a preliminary semantic representation of text, video and audio of the instance; private expert networks are respectively constructed for the modes of the text, the video and the audio, and private emotion feature vectors of the corresponding modes in the preliminary semantic representation of the text, the video and the audio are generated through the private expert networks; generating universal emotion feature vectors of corresponding modes in the preliminary semantic representations of the text, the video and the audio by using a shared expert network; A text-first incremental decoding module for: Fusing the private emotion feature vector and the text mode part of the universal emotion feature vector to generate a final emotion feature vector of the single text; starting an emotion analysis process by taking a text mode as a leading part, carrying out emotion classification prediction by utilizing a final emotion feature vector of a single text to obtain emotion types, and calculating the confidence coefficient of the emotion types; according to the confidence level of the emotion type, gradually introducing a private emotion feature vector corresponding to the video and the audio and a general emotion feature vector to carry out incremental supplement according to the needs, and then carrying out emotion classification prediction to realize an on-demand decoding strategy consistent with human cognition so as to obtain a final emotion type.
Description
Text-first multi-modal emotion analysis method and system Technical Field The invention relates to the field of natural language processing, in particular to a text-first multi-mode emotion analysis method and system. Background Multi-modal emotion analysis aims at identifying emotion tendencies from multi-modal data (e.g., text, video, and audio), and is one of the research directions of hot spots in the field of emotion computing in recent years. Compared with the traditional emotion analysis which only depends on a single text signal, the multi-modal emotion analysis can acquire richer and three-dimensional emotion clues from multisource signals such as expression actions, voice intonation and the like, and is beneficial to improving the accuracy and the robustness of emotion recognition. In the scenes of short video comments, live broadcast with goods, online classroom interaction and the like, users often express emotion tendencies through languages, and real emotion can be transmitted through non-language signals such as intonation, expression and even pause. For example, in a section of user review video, the text "this product goes still" appears neutral, but in combination with contempt's mood and disappointed facial expression, it can be inferred that its emotional tendency is negative. The multi-mode emotion analysis can synthesize emotion information of different modes, so that the accuracy of emotion recognition is improved, and more comprehensive support is provided for upper-layer applications such as recommendation systems, man-machine interaction, public opinion monitoring and the like. With the rapid development of video social and online interaction scenarios, multimodal emotion analysis has become a research hotspot in academia and industry. However, as different modes have significant differences in emotion expression forms, information bearing modes and the like, one of the core problems to be solved by the multi-mode emotion analysis method is to effectively fuse emotion information from texts, videos and audios so as to exert mode complementation advantages. The existing method generally utilizes mechanisms such as cross-modal attention, bidirectional interaction, dynamic fusion and the like to realize fine-grained modeling of semantic relations among different modalities and full utilization of complementary information. On the other hand, multi-modality data often presents a significant imbalance in real scenes, in that different modalities tend to be unequal in information quantity, quality and availability. Text often contains rich emotional cues and plays a dominant role in prediction, while emotional information of modalities such as video and audio is often relatively limited, and the influence of the emotional information in decision making is weak. Therefore, early multi-modal emotion analysis models are prone to problems of strong modal overdominance (such as text modal suppressing other modalities) or neglected weak modal contributions, resulting in non-ideal fusion effects. In order to alleviate modal unbalance, the existing method is mainly used for maintaining stable emotion recognition capability when different modes contribute unevenly through modes such as modal weight self-adaptive adjustment, weak modal enhancement training or distillation type correction. In recent years, the application of large models in the field of emotion analysis has received a great deal of attention. By means of massive pre-training data and strong semantic representation capability, a large model can capture fine semantic and emotion clues in a text, and compared with a traditional lightweight model, the large model has higher understanding depth and generalization capability. In the single-mode text emotion analysis, the large model not only can realize more accurate emotion classification, but also can process complex implicit emotion expression and aspect emotion tendencies. In multi-modal emotion analysis, through the combination of a text large model and a visual and audio encoder, the depth fusion of cross-modal information can be realized, and the accuracy and the robustness of emotion recognition are enhanced. In addition, the large model can provide interpretable support for emotion analysis, so that the model can not only predict emotion tendencies, but also point out emotion sources and trigger factors, and more credible emotion analysis capability is provided for upper-layer applications such as public opinion monitoring, man-machine interaction, customer service questioning and answering and the like. The existing multi-mode emotion analysis method generally assumes that three modes of video, audio and text have equal importance, so that information of all modes is always forcedly fused in the reasoning process. However, in a considerable number of scenes, emotion tendencies can be accurately predicted by only text, and at this time, additional introduction of video and audio