KR-20260065111-A - Users' Empathy Level Prediction System and Methods in Online Environment

KR20260065111AKR 20260065111 AKR20260065111 AKR 20260065111AKR-20260065111-A

Abstract

The present invention relates to a system for measuring user empathy levels in an online environment, and more specifically, to a system for predicting empathy levels based on multimodalities (video, voice, text, biosignals) collected from a user in an online environment. According to the present invention, the system comprises a data collection unit (100) that collects user conversation data in real time as video, voice, text, and biosignal modalities; a feature extraction unit (200) that extracts features of each modality from the collected data; a learning unit (300) that learns importance using cross-attention between each modality and adjusts weights for each modality through a bi-directional MLP-Mixer and a weight-based fusion module; and a classification unit (400) that combines the learned features to finally classify the user's empathy level, thereby enabling the measurement of the user's empathy level based on multimodalities (video, voice, text, biosignal) collected from the user in an online environment.

Inventors

양형정
임은채
신지은
이화령
김승원
김수형

Assignees

전남대학교산학협력단

Dates

Publication Date: 20260508
Application Date: 20241031

Claims (5)

A data collection unit (100) that collects user conversation data in real time as video, voice, text, and biosignal modalities, A feature extraction unit (200) that extracts features of each modality from the collected data above, A learning unit (300) that learns importance using cross-attention between each modality and adjusts modality-specific weights through a bi-directional MLP-Mixer and a weight-based fusion module, and A user empathy level measurement system in an online environment characterized by including a classification unit (400) that combines learned features to finally classify the user's empathy level.
In paragraph 1, The above feature extraction unit A video modality feature extraction module (210) that generates face embeddings using FaceNet512, analyzes the emotional state through a FER (Facial Emotion Recognition) model, extracts gestures using Mediapipe, and analyzes gaze data through GazeTracking, A speech modality feature extraction module (220) that resamples a speech signal and extracts advanced features of speech through HuBERT and Wav2Vec2.0 models, A text modality feature extraction module (230) that converts speech into text and extracts linguistic features from the converted text using a DistilKoBERT model, A user empathy level measurement system in an online environment characterized by including a biosignal modality feature extraction module (240) that reflects temporal changes in biosignals by embedding biosignal data such as Electrodermal Activity (EDA), Blood Volume Pulse (BVP), body temperature, and Metabolism (MET) in sequence units.
In paragraph 1, The above learning unit (300) A Bi-directional MLP-Mixer module (310) that generates richer feature representations by simultaneously learning temporal and contextual information through bidirectional processing of the input sequence of each modality, A Cross-attention module (320) that learns Cross-attention weights between each modality to reflect the importance of each modality and assigns higher weights to important modalities based on the association of interactions, and A user empathy level measurement system in an online environment characterized by including a weight-based fusion module (330) that dynamically learns the importance of each modality based on cross-attention results and performs optimal empathy prediction by combining the features of all modalities based on weights.
In paragraph 1, The above classification unit (400) A feature fusion module (410) that combines features reflecting the weights of each modality extracted from the learning unit (300) to generate a single comprehensive feature vector, A user empathy level measurement system in an online environment characterized by including an empathy level classification module (420) that classifies a predefined empathy level based on a comprehensive feature vector generated in the above feature fusion module and finally determines the user's empathy state as one of seven empathy levels.
In a method using a user empathy level measurement system in an online environment, (a) A step in which the above-described user empathy level measurement system collects conversation data between users in real time as video, voice, text, and biosignal modalities; (b) A feature extraction step in which the above-described user empathy level measurement system extracts features of each modality from the collected data; (c) A step in which the user empathy level measurement system learns the importance between modalities using cross-attention on the features of each extracted modality, and generates learned features by adjusting the weights for each modality through a bi-directional MLP-Mixer and a weight-based fusion module; and (d) a step of combining each learned modality feature of the user empathy level measurement system to generate a single comprehensive feature vector, and predicting the empathy level based on this vector to classify the user's empathy state into one of seven predefined empathy levels; a method using a user empathy level measurement system in an online environment characterized by including

Description

User's Empathy Level Prediction System and Methods in Online Environment The present invention relates to a system for measuring user empathy levels in an online environment, and more specifically, to a system for predicting empathy levels based on multimodalities (video, voice, text, biosignals) collected from a user in an online environment. One of the social interactions necessary for people to maintain harmonious relationships is empathy. People want to be empathized with while sharing their stories with others, and they feel a sense of intimacy with those who show empathy. Various psychological studies have found that when a person meets someone new, sharing personal stories leads to experiencing empathy from the other person. With the advancement of technology, the constraints on the space and time for human interaction are disappearing. Previously, friendships were primarily formed offline by meeting in person, talking face-to-face, and sharing empathy, but recently, opportunities to communicate by utilizing online environments are increasing. In an online environment, one can easily form new relationships through video chatting or exchanging text messages without meeting in person. Even online, one can see the other person's facial expressions in real time and engage in voice chatting and text conversations through various features. Through this process, one can empathize with or be empathized with others. The reason empathy is important online is that it is a major factor in determining a user's psychological well-being. Although non-face-to-face conversations have become commonplace in the metaverse and online environments, there is a lack of system development to measure whether users are empathizing with one another in these settings. Furthermore, since empathy must be measured in real-time, real-time characteristics must be reflected, but research in this area is insufficient. There is a problem in that it is difficult to accurately measure empathy between users using a single modality (text) alone. FIG. 1 is a configuration diagram of a user empathy level measurement system in an online environment according to one embodiment of the present invention. FIG. 2 shows a multimodal-based empathy level measurement model of a user empathy level measurement system in an online environment according to an embodiment of the present invention. Figure 3 shows the Bi-directional MLP-Mixer structure of a user empathy level measurement system in an online environment according to one embodiment of the present invention. FIG. 4 is an overall flowchart of a method using a user empathy level measurement system in an online environment according to one embodiment of the present invention. In this invention, data of two people conversing with each other is collected as video, voice, text, and biosignals, reflecting real-time characteristics, and a multimodal empathy prediction model is developed using this data. In addition, to improve the accuracy of the prediction in the developed model, a new approach called a Bi-directional MLP-Mixer model and a weight-based fusion module is proposed. In other words, the present invention proposes a novel Bi-directional MLP-Mixer layer. Furthermore, it predicts the user's empathy level through a fusion module that learns the weights of cross-attention between features of each modality and combines them based on weight ratios for each modality. In this invention, facial features, emotion analysis, gestures, and gaze from a user's conversation video are used as visual features, voice signals are extracted from the user's voice and used as audio signals, and text and biosignals (Electrodermal activity, EDA; Blood Volume Pulse, BVP; Temperature, Metabolic Equivalent of Task, MET) are used for empathy measurement. The method is configured to extract and combine multimodal features, and then train a prediction model. To efficiently learn the features of each modality, cross-attention is applied to each modality's features, and the weights of the resulting outputs are trained according to importance within the fusion module. This design allows for dynamic fusion during the model's training process to determine which modalities are considered more important when predicting empathy levels. In one embodiment of the present invention, a model is proposed for measuring empathy levels through conversations between users in an online environment. Image features were extracted using Facenet, FER, Pose, and Gaze; voice features were extracted using the pre-trained models HuBERT and Wav2Vec2.0; text features were extracted using DistilKoBERT; and biosignals are embedded using a sequence-based embedding technique. A Bi-directional MLP-Mixer model was used for video and audio features, while a Transformer model was used for text and biosignal features. The model was designed to learn during training which modalities play a more significant role in predicting empathy levels during the fusion pro