KR-20260064329-A - Multimodal based Emotion Analysis Method and Apparatus

KR20260064329AKR 20260064329 AKR20260064329 AKR 20260064329AKR-20260064329-A

Abstract

An embodiment of the present disclosure comprises, in a multimodal-based emotion analysis device, an acoustic expression extraction unit that extracts an acoustic expression representing the acoustic characteristics of a speaker from a speech signal using a pre-trained acoustic feature extraction model; a speech expression extraction unit that extracts a speech expression including sematic information of speech from the speech signal using an encoder of a speech recognition model; a text expression extraction unit that extracts a speech-referenced text expression reflecting the context of the conversation and emotion inherent in the speech expression from text data transcribed from the speech signal using a text encoder configured to perform cross-attention between a speech modality and a text modality; and an emotion classification unit that generates a multimodal representation by combining the acoustic expression and the speech-referenced text expression, and classifies the emotion by applying the multimodal representation to a fully connected layer.

Inventors

김창현
이상율
조한수
김다혜
성은선
이하림

Assignees

에스케이텔레콤 주식회사

Dates

Publication Date: 20260507
Application Date: 20241031

Claims (9)

In a multimodal-based emotion analysis device, An acoustic expression extraction unit that extracts an acoustic expression representing the acoustic characteristics of a speaker from a speech signal using a pre-trained acoustic feature extraction model; A speech expression extraction unit that extracts a speech expression containing semantic information of speech from the speech signal using an encoder of a speech recognition model; A text expression extraction unit that extracts a voice-referenced text expression reflecting the conversational context and emotion inherent in the voice expression from text data transcribed from the voice signal using a text encoder configured to perform cross-attention between a voice modality and a text modality; and A sentiment classification unit comprising a multi-modal representation that combines the acoustic representation and the speech-referenced text representation to generate a multi-modal representation and applies the multi-modal representation to a fully connected layer to classify emotions, device.
In paragraph 1, The above text encoder is, A cross-attention layer comprising receiving key embeddings and value embeddings extracted from the above speech expression, device.
In paragraph 1, The number of dimensions of the above voice representation and the above voice-referenced text representation is the same, device.
In paragraph 1, The encoder of the above speech recognition model and the text encoder are fine-tuned based on a training dataset for sentiment analysis, A method in which a Low-Rank Adaptation (LoRA) technique is applied to update the parameters of an adapter layer by decomposing the weight matrices of the original layers of the encoder of the speech recognition model and the text encoder into Low-Rank, device.
In paragraph 4, The adapter layer is applied to at least one of the encoder of the speech recognition model and the feed forward layer and attention layer of the text encoder. device.
In paragraph 1, The encoder of the above speech recognition model, the text encoder, and the fully connected layer are trained using a loss derived based on a training dataset for sentiment analysis, wherein The above loss includes cross-entropy loss, device.
In paragraph 6, The above loss is, Each emotion is set as the coordinates of a circle with a predetermined radius, and further includes an emotion representation loss calculated based on the distance between emotions, device.
In a computer implementation method for multimodal-based sentiment analysis, A process of extracting acoustic expressions representing the speaker's acoustic characteristics from a speech signal using a pre-trained acoustic feature extraction model; A process of extracting a speech expression containing semantic information of the speech from the speech signal using an encoder of a speech recognition model; A process of extracting a speech-referenced text expression reflecting the conversational context and emotion inherent in the speech expression from text data transcribed from the speech signal using a text encoder configured to perform cross-attention between a speech modality and a text modality; and A process comprising generating a multi-modal representation by combining the above acoustic representation and the above speech-referenced text representation, and classifying emotions by applying the above multi-modal representation to a fully connected layer, Computer implementation method.
A computer-readable recording medium having stored instructions, wherein the instructions, when executed by the computer, cause the computer to execute each process included in the method according to claim 8.

Description

Multimodal-based Emotion Analysis Method and Apparatus The present disclosure relates to a multimodal-based sentiment analysis method and apparatus. The following description merely provides background information related to the present embodiment and does not constitute prior art. Multimodal-based sentiment analysis technology is a technique for analyzing human emotions, and one such technique is multimodal representation pre-training. This technique primarily utilizes a method of processing two modals (e.g., audio and text) using separate encoders. This approach follows a structure where embedding values generated from each modality are symmetrically input into a cross-modality encoder. In this context, an efficient cross-attention structure that considers the differences between audio and text information is required to effectively reflect the differences between the two modalities. This structure focuses on generating an integrated representation while preserving the unique characteristics of each modality and minimizing information loss. However, since speech data contains emotional states not revealed in text (e.g., an agitated voice in a state of anger versus happiness), there may be limitations in objectively analyzing emotional states using only this structure. Figure 1 is a block diagram showing a multimodal-based sentiment analysis device. FIG. 2 is another example of a multimodal-based sentiment analysis device according to an embodiment of the present disclosure. FIG. 3 is a flowchart illustrating a multimodal-based sentiment analysis method according to an embodiment of the present disclosure. Figure 4 shows the configuration weights of the LoRA learning model. FIG. 5 is an example of a first LoRA application according to an embodiment of the present disclosure. FIG. 6 is an example of a second LoRA application according to an embodiment of the present disclosure. FIG. 7 is an example diagram of the application of Weighted Sum according to an embodiment of the present disclosure. FIG. 8 is an example diagram of a Weight Gating application according to an embodiment of the present disclosure. Figure 9 is an example of a general human emotional system. FIG. 10 is an example diagram of an appraisal table for appraisal direction and distance according to an embodiment of the present disclosure. FIG. 11 is an exemplary diagram of a human emotion system applied to an embodiment of the present disclosure. FIG. 12 is a block diagram schematically illustrating an exemplary computing device to which the present disclosure may be applied. Some embodiments of the present disclosure are described in detail below with reference to exemplary drawings. It should be noted that in assigning reference numerals to the components of each drawing, the same components are given the same reference numeral whenever possible, even if they are shown in different drawings. Furthermore, in describing the present disclosure, if it is determined that a detailed description of related known components or functions could obscure the essence of the present disclosure, such detailed description is omitted. In describing the components of the embodiments according to the present disclosure, symbols such as first, second, i), ii), a), b), etc., may be used. These symbols are intended only to distinguish the components from other components, and the essence, order, or sequence of the components is not limited by the symbols. When a part in the specification is described as 'comprising' or 'having' a component, this means that, unless explicitly stated otherwise, it does not exclude other components but may include additional components. Each component of the device or method according to the present invention may be implemented in hardware or software, or in a combination of hardware and software. Additionally, the function of each component may be implemented in software, and a microprocessor may be implemented to execute the function of the software corresponding to each component. Figure 1 is a block diagram showing a multimodal-based sentiment analysis device. The entity performing multimodal-based emotion analysis is to be called an emotion analysis device (100). The emotion analysis device (100) includes all or part of an acoustic expression extraction unit (102), a voice expression extraction unit (104), a text expression extraction unit (106), and an emotion classification unit (108). The acoustic expression extraction unit (102) extracts an acoustic expression representing the acoustic characteristics of the speaker from a speech signal using a pre-trained acoustic feature extraction model. The voice expression extraction unit (104) extracts a voice expression containing semantic information of the voice from the voice signal using an encoder of a voice recognition model. The text expression extraction unit (106) extracts a speech-referenced text expression that reflects the context and emotion of the conversation inheren