CN-121980314-A - Large language model emotion analysis method based on multi-mode thinking chain alignment

CN121980314ACN 121980314 ACN121980314 ACN 121980314ACN-121980314-A

Abstract

The invention provides a large language model emotion analysis method based on multi-mode thinking chain alignment, and relates to the technical field of artificial intelligence and multi-mode information processing. The method comprises the steps of firstly obtaining and preprocessing multi-modal emotion analysis input data at least comprising a text mode, a visual mode and a voice mode, then carrying out vectorization coding on each modal data, further carrying out modal-specific thinking chain reasoning by utilizing a large language model, respectively generating emotion reasoning thinking chains of the text mode, the voice mode and the visual mode under prompt driving, carrying out vectorization coding and fusion on each modal thinking chain, and finally carrying out emotion classification prediction on multi-modal thinking chain fusion expression to obtain a final emotion analysis result. According to the invention, by introducing an alignable multi-mode thinking chain reasoning mechanism, the reasoning capability of the large model and the complementary advantages of multi-mode information are fully exerted, and the accuracy, the robustness and the interpretability of emotion analysis are improved under the condition that a manual labeling reasoning process is not needed.

Inventors

LIU CHANG
SUN XIAO

Assignees

合肥工业大学

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (10)

1. A large language model emotion analysis method based on multi-modal thinking chain alignment is characterized by comprising the following steps: acquiring and preprocessing multi-mode emotion analysis input data, wherein the multi-mode emotion analysis input data at least comprises a text mode, a visual mode and a voice mode, and the visual mode comprises a video and an image; Taking the preprocessed modal data as the input of a corresponding modal encoder, and respectively acquiring a text modal potential representation, a visual modal potential representation and a voice modal potential representation; Performing mode-specific thinking chain reasoning on each mode potential representation by using a large language model to respectively generate a text thinking chain, a visual thinking chain and a voice thinking chain; vectorizing coding is carried out on each modal thinking chain, and text thinking chain embedded representation, visual thinking chain embedded representation and voice thinking chain embedded representation are respectively obtained; Adopting an attention weighting mechanism to carry out weighted fusion on the embedded representation of each modal thinking chain, and constructing a unified multi-modal thinking chain fusion representation; and carrying out emotion classification prediction on the multi-mode thinking chain fusion representation to obtain a final emotion analysis result.
2. The large language model emotion analysis method of claim 1, wherein said preprocessing includes: word segmentation is carried out on the text modal data, and stop words and non-emotion related symbols are removed; and/or carrying out key frame extraction and image normalization processing on the visual mode data; and/or noise reduction, speech segmentation and acoustic feature extraction of the speech ferris data.
3. The method for emotion analysis of a large language model as recited in claim 1, wherein said performing modality-specific chain of thought reasoning on each modality potential representation using the large language model generates a text chain of thought, a visual chain of thought, and a speech chain of thought, respectively, comprising: designing a mode-specific prompt template, wherein the prompt template is used for guiding a large language model to understand emotion clues in input data; And filling a corresponding mode-specific prompt template by utilizing each mode potential representation, respectively generating a text thinking chain comprising emotion cue recognition, context analysis and implicit emotion inference, generating a visual thinking chain comprising facial expression, gesture action and scene semantic analysis, and generating a voice thinking chain comprising intonation, speech speed and emotion intensity analysis.
4. The emotion analysis method of large language model of claim 1, wherein each modal thinking chain is vectorized encoded by a transducer encoder.
5. The method of claim 1, wherein emotion alignment loss functions are constructed during a training phase, and include emotion consistency loss, mind chain embedding alignment loss and conflict resolution loss, wherein the emotion consistency loss is used for ensuring emotion prediction consistency of the model in each mode and in multiple modes, the mind chain embedding alignment loss is used for restricting the relative distance of mind chain embedding representations of different modes in a feature space, and the conflict resolution loss is used for avoiding dependence of a multi-mode prediction result on a single mode.
6. The large language model emotion analysis method of claim 5, wherein said mental chain embedding alignment loss is: Wherein, the Is a two-norm, e i 、e j is the input representation of the ith and jth emotion classification prediction tasks in the training stage respectively, and subscript T, V, A, MM corresponds to the text-to-think chain embedded representation, the visual-to-think chain embedded representation, the speech-to-think chain embedded representation and the multi-modal-to-think chain fusion representation respectively.
7. The large language model emotion analysis method of claim 5, wherein said collision resolution loss is: Wherein, the Is a two-norm; The emotion classification prediction results of the text modal potential representation, the visual modal potential representation, the voice modal potential representation and the multi-modal thinking chain are respectively obtained.
8. A multi-modal thought chain alignment-based emotion analysis system for a large language model, comprising: the data acquisition and preprocessing module is used for acquiring and preprocessing multi-mode emotion analysis input data and at least comprises a text mode, a visual mode and a voice mode, wherein the visual mode comprises a video and an image; The modal coding module is used for taking each preprocessed modal data as the input of a corresponding modal coder to respectively acquire a text modal potential representation, a visual modal potential representation and a voice modal potential representation; the system comprises a thinking chain reasoning module, a text thinking chain, a visual thinking chain and a voice thinking chain, wherein the thinking chain reasoning module is used for carrying out mode-specific thinking chain reasoning on each mode potential representation by utilizing a large language model; The thinking chain coding module is used for vectorizing coding of each modal thinking chain to respectively acquire text thinking chain embedded representation, visual thinking chain embedded representation and voice thinking chain embedded representation; The thinking chain fusion module is used for carrying out weighted fusion on the embedded expressions of the thinking chains of all modes by adopting an attention weighting mechanism to construct a unified multi-mode thinking chain fusion expression; and the emotion analysis module is used for carrying out emotion classification prediction on the multi-mode thinking chain fusion representation to obtain a final emotion analysis result.
9. A computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and wherein the computer program when executed by a processor performs the method of any of the preceding claims 1-7.
10. A computer device comprising a memory, a processor, the memory having stored thereon a computer program executable on the processor, wherein the processor, when executing the computer program, implements the method of any of the preceding claims 1-7.

Description

Large language model emotion analysis method based on multi-mode thinking chain alignment Technical Field The invention relates to the technical field of artificial intelligence and multi-modal information processing, in particular to a large language model emotion analysis method based on multi-modal thinking chain alignment. Background With the rapid popularization of internet technology and intelligent terminals, social media platforms (such as microblogs, twitter, short video platforms and the like) have become important channels for users to express emotion attitudes, release views and perform social interaction. The information released by the user in the social media usually contains multiple modes such as texts, images, videos and voices, and the multi-mode data contain rich emotion clues, so that the multi-mode data have important values for application scenes such as public opinion analysis, public safety monitoring, human-computer interaction, psychological health assessment and the like. Therefore, how to accurately model and analyze emotion information in multi-modal data becomes an important research direction in the current artificial intelligence and emotion calculation fields. In the related art, a conventional multi-mode emotion analysis method generally adopts a multi-path encoder to extract text, visual and voice features respectively, and performs fusion in a manner of feature stitching, attention mechanism or graph structure network, so as to obtain a final emotion prediction result. The method can utilize complementary information among multiple modes to a certain extent, but the whole method still mainly depends on the association modeling of the feature level, lacks an explicit reasoning mechanism for the emotion forming process, and is difficult to explain the model decision basis. In recent years, large language models (Large Language Model, LLM) have demonstrated excellent capabilities in natural language understanding and complex reasoning tasks, and particularly intermediate reasoning steps can be explicitly generated in a Chain of thought (Chain-of-Thought, COT) manner, thereby improving the modeling capability of the model for implicit semantics and logical relationships. However, the current thinking chain technology is mainly applied to a single-mode text scene, and a plurality of challenges are still faced by directly expanding the single-mode text scene to multi-mode emotion analysis, and particularly, inconsistent or even conflicted reasoning information generated by different modes can exist, and if an effective alignment and constraint mechanism is lacking, reasoning deviation or irrelevant reasoning results are easy to generate. Disclosure of Invention (One) solving the technical problems Aiming at the defects of the prior art, the invention provides a large language model emotion analysis method based on multi-mode thinking chain alignment, which solves the problem of how to effectively process emotion conflict and inconsistency between different modes. (II) technical scheme In order to achieve the above purpose, the invention is realized by the following technical scheme: A large language model emotion analysis method based on multi-modal thinking chain alignment comprises the following steps: acquiring and preprocessing multi-mode emotion analysis input data, wherein the multi-mode emotion analysis input data at least comprises a text mode, a visual mode and a voice mode, and the visual mode comprises a video and an image; Taking the preprocessed modal data as the input of a corresponding modal encoder, and respectively acquiring a text modal potential representation, a visual modal potential representation and a voice modal potential representation; Performing mode-specific thinking chain reasoning on each mode potential representation by using a large language model to respectively generate a text thinking chain, a visual thinking chain and a voice thinking chain; vectorizing coding is carried out on each modal thinking chain, and text thinking chain embedded representation, visual thinking chain embedded representation and voice thinking chain embedded representation are respectively obtained; Adopting an attention weighting mechanism to carry out weighted fusion on the embedded representation of each modal thinking chain, and constructing a unified multi-modal thinking chain fusion representation; and carrying out emotion classification prediction on the multi-mode thinking chain fusion representation to obtain a final emotion analysis result. Preferably, the pretreatment includes: word segmentation is carried out on the text modal data, and stop words and non-emotion related symbols are removed; and/or carrying out key frame extraction and image normalization processing on the visual mode data; and/or noise reduction, speech segmentation and acoustic feature extraction of the speech ferris data. Preferably, the performing mode-specific mental chain reasoning o