CN-121997242-A - Multi-mode emotion analysis model based on multi-head attention perception fusion

CN121997242ACN 121997242 ACN121997242 ACN 121997242ACN-121997242-A

Abstract

The invention discloses a multi-mode emotion analysis model based on multi-head attention perception fusion, which is characterized in that firstly, the feature extraction of a multi-mode multi-head attention mechanism is carried out and is used for projection to a vector subspace so as to reduce the dimension of features and reserve the most important emotion feature information, so that preparation is made for subsequent analysis. Secondly, the tag of the emotion data set is utilized to calculate data of the multi-mode features, and fusion of the multi-mode features is achieved in a projection vector space, so that different sensing channel information from texts, audios and videos is effectively integrated. Finally, by conducting experiments on the multi-modal datasets of CMU-MOSI and CMU-MOSEI, the effectiveness and performance of the multi-headed attention mechanism in the multi-modal emotion analysis task are verified. Experimental results show that the multi-mode emotion analysis method introducing the multi-head attention mechanism achieves remarkable improvement in the aspects of recognition accuracy, emotion score, precision and recall rate.

Inventors

Li caimao
CHEN SHAOFAN
CHEN BAIXIONG
ZHANG HAOYANG

Assignees

海南大学

Dates

Publication Date: 20260508
Application Date: 20241108

Claims (13)

1. The multi-mode data set is collected and can be divided into three parts of training, verification and testing, and meanwhile, the multi-mode data set comprises data in different modes, namely text, acoustics and vision, and feature extraction is carried out.
2. In the aspect of text feature research, a BERT model is adopted to extract features of a text mode.
3. In the aspect of acoustic feature research, COVAREP correlation methods are used to extract acoustic features.
4. In the aspect of visual feature research, facet is utilized for visual feature extraction.
5. After the feature extraction is finished, a feature mean value can be calculated through a formula and divided by the total number N of input data samples. This way, the average of the set of data can be obtained.
6. After the multi-modal data set is subjected to feature extraction of different classifications, the multi-head attention mechanism is combined to ensure that each mode can pay more attention to emotion tag information, and the accuracy of subsequent multi-modal emotion analysis is improved.
7. The input matrix X is linearly transformed by three self-contained weight matrices, representing query, key and value, respectively.
8. After three values are obtained, an attention weight V is calculated, wherein Q, K, V represents the linear transformations of the Query (Query), key (Key) and Value (Value), respectively, which is the linear transformation matrix of each attention header i, which is the dimension of the Query or Key.
9. Finally, all attention heads are spliced, and an output weight matrix is represented in a formula (see fig. 2).
10. And marking emotion key information on the modes after feature extraction through a multi-head attention mechanism, and then starting multi-mode fusion analysis.
11. Using the extracted feature vectors as input, the Linear layer matches the dimensions between the modalities, and the modalities are projected onto the same space by Linear to form a new matrix.
12. Next, fusion is performed, two different modes a and B are connected, and segments of a third mode are inserted into segments for multi-mode fusion (see fig. 1).
13. The model was trained using the CMU-MOSI, CMU-MOSEI common dataset and the proposed method was the best choice by comparing other baseline models (see fig. 4).

Description

Multi-mode emotion analysis model based on multi-head attention perception fusion Technical Field The invention belongs to the field of natural language processing emotion analysis, and designs a multi-head attention perception fusion emotion analysis model Background In the fields of electronic commerce, movie websites, online advertisements and the like, in order to improve user experience and the like, a user is expected to accurately recommend articles of interest to the user, including commodities, movies and the like, to the client in the process of browsing web pages or using software. A large amount of data can be generated by a user in the use process, and the preference of the user is calculated from the data by using the recommendation model, so that the manual participation is greatly reduced, and the accuracy and the speed of recommendation can be effectively improved. The existing emotion analysis model has the problems of insufficient analysis accuracy, single application field of the model and the like, and in other fields, the model needs to be modified to achieve basic recommendation accuracy. In summary, the existing multi-modal emotion model has the problems of insufficient analysis precision in the multi-modal data set, poor effectiveness in multiple fields and the like, so that the comprehensive analysis precision needs to be improved. Disclosure of Invention In order to solve the defect of the analysis precision in the multi-modal research in the prior art, the invention provides a multi-modal emotion analysis method based on multi-head attention perception fusion. The single-mode problem in the emotion analysis field is solved, and the analysis precision is improved. The technical proposal is as follows: the multi-mode emotion analysis model based on multi-head attention perception fusion mainly comprises the following steps: (1) In consideration of the complexity of multiple fields and the diversity of input data, the characteristics of the multi-mode mixed data set are extracted by different modes; (2) After the multi-mode data set is subjected to feature extraction of different classifications, the multi-head attention mechanism is combined to ensure that each mode can pay more attention to emotion tag information, and the extracted feature information can be judged more accurately; (3) Marking emotion key information on the modes after feature extraction through a multi-head attention mechanism, and then starting multi-mode fusion by utilizing corresponding information; (4) The fusion process adopts interaction fusion forwards, and the emotion information is extracted more accurately through a multi-head attention mechanism and a fused two-time interaction process; (5) And under the same environment, the comparison experiment is carried out with other baseline models for multiple times, and the obtained experimental result shows that the accuracy of the proposed model is improved. Drawings In order to more clearly illustrate the specific technical solutions of the present invention, the drawings referred to therein will be described below. FIG. 1 is a diagram of a multi-head attention fusion emotion analysis model for data input and feature extraction; FIG. 2 is a diagram of a multi-headed attention mechanism internal model; FIG. 3 is an overall operational schematic; FIG. 4 is a graph of results experimental versus analysis. Detailed Description The present invention will be described below with reference to the accompanying drawings. The invention provides a multi-mode emotion analysis model based on multi-head attention perception fusion, which solves the problem of improving the precision of emotion analysis under multiple modes, and can effectively improve the recommendation precision and efficiency through multiple experiments and data comparison of multiple baseline models. In the field of multi-mode emotion analysis, a model acquires related data of a multi-mode dataset, the data are extracted with important mode feature information, emotion information is focused through a multi-head attention mechanism, feature interaction among modes is respectively carried out, and finally cross forward fusion is carried out, so that a predicted result is obtained. The specific operation flow of the invention is as follows: 1. the multi-mode data set is collected and can be divided into three parts of training, verification and testing, and meanwhile, the multi-mode data set comprises data in different modes, namely text, acoustics and vision, and feature extraction is carried out. 2. In the aspect of text feature research, a BERT model is adopted to extract features of a text mode. 3. In the aspect of acoustic feature research, COVAREP correlation methods are used to extract acoustic features. 4. In the aspect of visual feature research, facet is utilized for visual feature extraction. 5. After the feature extraction is finished, a feature mean value can be calculated through a formu