CN-122024014-A - Emotional state dynamic monitoring system and method based on space-time characteristics of video images

CN122024014ACN 122024014 ACN122024014 ACN 122024014ACN-122024014-A

Abstract

The invention relates to an emotion state dynamic monitoring system and method, in particular to an emotion state dynamic monitoring system and method based on space-time characteristics of video images, which solve the technical problems that the existing emotion state monitoring system and method is single in recognition dimension, poor in environmental adaptability and lack of time sequence modeling, so that emotion recognition accuracy is low, and long-term evolution trend of the emotion state is difficult to capture effectively. The invention adopts the emotion state recognition module to recognize and quantify the emotion state to obtain the emotion intensity, then combines the time sequence modeling and trend analysis module to obtain the emotion duration score and the emotion intensity trend, and can comprehensively reflect the real emotion state, simultaneously adopts the space feature and the time sequence feature of the facial video data to be respectively extracted by the time-space feature extraction module, reflects the time sequence evolution feature of the emotion state by the time sequence modeling, and can ensure the emotion recognition precision and the stability by evaluating the environment factor and self-adaptively adjusting the weight of each frame of image in the facial video data.

Inventors

WANG YUQI
WANG CHUO
SUN CHAO
FAN QI
Dang Ruochen
HU BINGLIANG

Assignees

中国科学院西安光学精密机械研究所

Dates

Publication Date: 20260512
Application Date: 20260414

Claims (10)

1. The emotion state dynamic monitoring system based on the space-time characteristics of the video images is characterized by comprising a video acquisition and preprocessing module, a space-time characteristic extraction module, a self-adaptive fusion module, an emotion state identification module, a time sequence modeling and trend analysis module and a risk assessment and early warning module which are connected in sequence; the video acquisition and preprocessing module is used for acquiring and preprocessing the facial video data of the monitored object, and the space-time characteristic extraction module is used for respectively extracting the space characteristics and the time sequence characteristics; The output end of the self-adaptive fusion module is respectively connected with the input end of the emotion state identification module and the first input end of the time sequence modeling and trend analysis module, and is used for dynamically adjusting the fusion weights of the space characteristics and the time sequence characteristics and fusing the space characteristics and the time sequence characteristics to obtain space-time fusion characteristics; The output end of the emotion state identification module is respectively connected with the second input end of the time sequence modeling and trend analysis module and the first input end of the risk assessment and early warning module, and the emotion state identification module is used for identifying the emotion state according to the space-time fusion characteristics and quantifying the emotion state to obtain emotion intensity; the output end of the time sequence modeling and trend analysis module is connected with the second input end of the risk assessment and early warning module and is used for analyzing the time evolution rule of the emotion state according to the emotion state and the space-time fusion characteristic to obtain abnormal information, emotion duration score and emotion intensity trend; The risk assessment and early warning module is used for assessing risk according to the emotion intensity, the abnormal information, the emotion duration score and the emotion intensity trend, and carrying out early warning on the monitored object by combining the emotion state and the abnormal information.
2. The system for dynamically monitoring emotional state based on the temporal and spatial features of the video image according to claim 1, wherein the temporal and spatial feature extraction module comprises a spatial feature extraction unit and a time sequence feature extraction unit; the spatial feature extraction unit adopts a ResNet-50 network based on an attention mechanism, and the ResNet-50 network based on the attention mechanism comprises a convolution layer, a Block1 layer, a Block2 layer, a Block3 layer and a Block4 layer which are sequentially connected, and a multi-scale feature fusion layer and a global pooling layer; The input end of the convolution layer is connected with the output end of the video acquisition and preprocessing module, the output ends of the Block1 layer, the Block2 layer, the Block3 layer and the Block4 layer are respectively connected with the input end of the multi-scale feature fusion layer, the output end of the multi-scale feature fusion layer is connected with the input end of the global pooling layer, and the output end of the global pooling layer is connected with the first input end of the self-adaptive fusion module; The Block1 layer, the Block2 layer, the Block3 layer and the Block4 layer are used for respectively extracting facial spatial features of different spatial scales, the multi-scale feature fusion layer is used for fusing the facial spatial features of different spatial scales according to the channel attention and the spatial attention, and the global pooling layer is used for carrying out global pooling on the fused facial spatial features to obtain spatial features; The time sequence feature extraction unit comprises a bidirectional LSTM network, a self-attention network, a position coding network, a layer normalization network and a time sequence fusion network which are sequentially connected; The input end of the bidirectional LSTM network is connected with the output end of the video acquisition and preprocessing module and is used for extracting facial time sequence characteristics of different time scales, the self-attention network, the position coding network and the layer normalization network are respectively used for carrying out weight distribution, position coding and normalization on characteristic components in the facial time sequence characteristics, and the output end of the time sequence fusion network is connected with the second input end of the self-adaptive fusion module and is used for fusing the facial time sequence characteristics of different time scales according to the position coding and the weights to obtain the time sequence characteristics.
3. The system for dynamically monitoring the emotional state based on the spatial and temporal characteristics of the video images according to claim 2, wherein the adaptive fusion module comprises an environmental factor evaluation unit, a dynamic weight calculation unit, a quality gating unit, an adaptive spatial and temporal characteristic fusion unit and a video characteristic fusion unit; The input end of the environmental factor evaluation unit is connected with the output end of the global pooling layer, the first output end of the environmental factor evaluation unit is connected with the first input end of the dynamic weight calculation unit, the second output end of the environmental factor evaluation unit is connected with the input end of the quality gating unit, the environmental factor evaluation unit is used for evaluating the environmental factors of the face video data frame by frame according to the spatial characteristics to obtain environmental factor scores, and the overall quality of each frame of image in the face video data is obtained according to the environmental factor scores; the second input end of the dynamic weight calculation unit is connected with the output end of the time sequence fusion network, and the output end of the dynamic weight calculation unit is connected with the first input end of the self-adaptive time-space feature fusion unit and is used for calculating fusion weights of the space features and the time sequence features frame by frame according to the environmental factor scoring and the time sequence features; The second input end and the third input end of the self-adaptive space-time feature fusion unit are respectively connected with the global pooling layer and the output end of the time sequence fusion network, the output end of the self-adaptive space-time feature fusion unit is connected with the first input end of the video feature fusion unit, and the self-adaptive space-time feature fusion unit is used for fusing the space feature and the time sequence feature frame by frame according to the fusion weight of the space feature and the time sequence feature to obtain the fusion feature of each frame of image in the face video data; The output end of the quality gating unit is connected with the second input end of the video feature fusion unit and is used for adjusting the weight of each frame of image in the facial video data according to the overall quality of the image; The output end of the video feature fusion unit is respectively connected with the input end of the emotion state identification module and the first input end of the time sequence modeling and trend analysis module, and is used for fusing the fusion features of all images in the facial video data according to the weight to obtain space-time fusion features.
4. The system for dynamically monitoring emotional states based on the temporal and spatial features of video images according to claim 3, wherein the emotional state recognition module comprises a multi-label classifier and an emotional intensity regression unit, and the time sequence modeling and trend analysis module comprises a time sequence modeling unit and a trend prediction unit; The system comprises a multi-tag classifier, an emotion intensity regression unit, a trend prediction unit, a video feature fusion unit, a risk assessment and early warning module, a time sequence modeling unit, a time sequence feature fusion unit, a time sequence modeling unit and a time sequence feature fusion unit, wherein the input ends of the multi-tag classifier, the emotion intensity regression unit and the trend prediction unit are respectively connected with the output end of the time sequence modeling unit; The time sequence modeling unit adopts a hidden Markov model, an output end is connected with a second input end of the risk assessment and early warning module, the time sequence modeling unit is used for identifying abnormal information of an emotion state and emotion duration scoring, the abnormal information comprises abnormal identification, abnormal types and abnormal scoring, the trend prediction unit adopts an ARIMA model, the output end is connected with the second input end of the risk assessment and early warning module, and the trend prediction unit is used for predicting emotion intensity trend according to space-time fusion characteristics.
5. The system for dynamically monitoring the emotional state based on the spatial and temporal characteristics of the video images according to claim 4, wherein the risk assessment and early warning module comprises a risk assessment unit, a risk level classification unit and an intelligent decision early warning unit; The first input end, the second input end and the third input end of the risk assessment unit are respectively connected with the output ends of the emotion intensity regression unit, the time sequence modeling unit and the trend prediction unit, the output ends of the risk assessment unit are connected with the first input end of the risk grade classification unit, and the risk assessment unit is used for assessing risks according to emotion intensity, abnormal identification, abnormal scoring, emotion duration scoring and emotion intensity trend to obtain risk scoring; The second input end of the risk level classification unit is connected with the output end of the time sequence modeling unit, the output end of the risk level classification unit is connected with the first input end of the intelligent decision early warning unit, and the risk level classification unit is used for determining an early warning level according to the abnormal identification, the abnormal score and the risk score; the second input end of the intelligent decision early warning unit is connected with the output end of the multi-label classifier, the third input end of the intelligent decision early warning unit is connected with the output end of the time sequence modeling unit, and the intelligent decision early warning unit is used for providing a personalized early warning scheme for a monitored object according to the abnormal type, the early warning level and the emotion state and carrying out early warning.
6. A dynamic monitoring method for emotional states based on temporal and spatial features of video images, which adopts the dynamic monitoring system for emotional states based on temporal and spatial features of video images according to any one of claims 1 to 5, and is characterized by comprising the following steps: Step 1, a video acquisition and preprocessing module acquires face video data of a monitored object, preprocesses the face video data and sends the face video data to a space-time feature extraction module; step 2, a space-time feature extraction module extracts space features and time sequence features of the face video data respectively, and then sends the space features and the time sequence features to an adaptive fusion module for fusion to obtain space-time fusion features, and sends the space-time fusion features to a emotion state recognition module and a time sequence modeling and trend analysis module respectively; Step 3, an emotion state identification module identifies an emotion state according to the space-time fusion characteristics, quantifies the emotion state to obtain emotion intensity, and then sends the emotion state to a time sequence modeling and trend analysis module and sends the emotion state and the emotion intensity to a risk assessment and early warning module; Step 4, the time sequence modeling and trend analysis module recognizes abnormal information and emotion duration according to the space-time fusion characteristics and the emotion states, predicts emotion intensity trend according to the space-time fusion characteristics, obtains emotion duration scores according to the emotion duration, and sends the abnormal information, the emotion duration scores and the emotion intensity trend to the risk assessment and early warning module; And 5, evaluating risks by the risk evaluation and early warning module according to the emotion intensity, the abnormal information, the emotion duration score and the emotion intensity trend, carrying out early warning on the monitored object by combining the emotion state and the abnormal information, and returning to the step 1, and continuously collecting facial video data of the monitored object until the dynamic monitoring of the emotion state is completed.
7. The method for dynamically monitoring emotional states based on the temporal and spatial characteristics of video images according to claim 6, wherein in step 4, the anomaly information includes anomaly identification, anomaly type and anomaly score, and if the anomaly identification is that there is no anomaly, in step 5, the risk is estimated by the following formula: Risk_Score = α•I + β•D + γ•T Wherein, risk_score is a Risk Score, I is an emotion intensity, D is an emotion duration Score, T is an emotion intensity trend, and α, β, γ are weight parameters of I, D, T, respectively; if the anomaly is identified as an anomaly, in step 5, the risk is assessed by the following formula: Risk_Score = α•I + β•D + γ•T + η•K Wherein K is an anomaly score, and eta is a weight parameter of K.
8. The method for dynamically monitoring emotional states based on temporal and spatial characteristics of video images according to claim 6 or 7, wherein step 2 is specifically: Step 2.1, a spatial feature extraction unit of a spatial feature extraction module extracts facial spatial features in facial video data according to different spatial scales, fuses the facial spatial features according to channel attention and spatial attention, carries out global pooling on the fused facial spatial features to obtain spatial features, and sends the spatial features to an environmental factor evaluation unit of an adaptive fusion module and an adaptive spatial feature fusion unit; Step 2.2, a time sequence feature extraction unit of the time sequence feature extraction module extracts the time sequence features of the face in the face video data according to different time scales, sequentially performs weight distribution, position coding and normalization on feature components in the time sequence features of the face, fuses the time sequence features of the face in different time scales according to the position coding and the weights to obtain time sequence features, and sends the time sequence features to a dynamic weight calculation unit of the self-adaptive fusion module and the self-adaptive time sequence feature fusion unit; Step 2.3, the environmental factor evaluation unit of the self-adaptive fusion module evaluates the environmental factors of the face video data frame by frame according to the spatial characteristics to obtain environmental factor scores, obtains the overall quality of each frame of image in the face video data according to the environmental factor scores, sends the environmental factor scores to the dynamic weight calculation unit, and sends the overall quality of each frame of image in the face video data to the quality gating unit; Step 2.4, the dynamic weight calculation unit obtains the fusion weight of the spatial feature and the time sequence feature of each frame of image in the face video data according to the environmental factor score and the time sequence feature, and sends the fusion weight to the self-adaptive time-space feature fusion unit, and the self-adaptive time-space feature fusion unit fuses the spatial feature and the time sequence feature according to the self-adaptive time-space feature fusion unit frame by frame to obtain the time-space feature of each frame of image in the face video data, and sends the time-space feature to the video feature fusion unit; the quality gating unit adjusts the weight of each frame of image in the face video data according to the overall quality of each frame of image and sends the weight to the video feature fusion unit; And 2.5, the video feature fusion unit fuses the space-time features of all images in the facial video data according to the weights to obtain space-time fusion features, and the space-time fusion features are respectively sent to the emotion state identification module and the time sequence modeling and trend analysis module.
9. The method for dynamically monitoring emotional states based on temporal and spatial characteristics of video images according to claim 8, wherein in step 1, the specific method for preprocessing is as follows: A. image stabilization, namely adopting a stabilization algorithm based on characteristic point matching, detecting stable characteristic points through Harris angular points, adopting a Lucas-Kanade optical flow estimation algorithm to track the motion trail of the characteristic points, estimating the geometric transformation relation between each frame of image in the face video data through a RANSAC algorithm, carrying out affine transformation according to the geometric transformation relation, and carrying out motion compensation through an interpolation method; B. illumination normalization, namely adopting a multi-scale Retinex algorithm to perform illumination equalization, combining a restrictive self-adaptive histogram equalization algorithm to enhance local contrast, and then self-adaptively adjusting brightness through gamma correction; C. performing face detection and tracking, namely performing multi-scale face detection by adopting MTCNN networks, performing stable tracking on a face motion track by adopting a Kalman filter, and performing bounding box optimization on the face motion track; D. and extracting a key region, namely accurately positioning the eye, mouth and eyebrow regions by adopting a 68-point marking method based on a Dlib shape predictor, extracting a corresponding region of interest according to the eye, mouth and eyebrow regions, and performing scale normalization and geometric correction.
10. The method for dynamically monitoring emotional state based on the temporal and spatial features of video images according to claim 9, wherein in step 2.3, the environmental factors include illumination intensity, head pose, face occlusion, and image sharpness; in step 3, the emotional states include emotional category, emotional value, and emotional activation.

Description

Emotional state dynamic monitoring system and method based on space-time characteristics of video images Technical Field The invention relates to an emotion state dynamic monitoring system and method, in particular to an emotion state dynamic monitoring system and method based on video image space-time characteristics. Background Along with the continuous improvement of operation safety requirements in the fields of transportation, industrial production, intelligent acceleration and the like, real-time and accurate monitoring of emotional states gradually becomes a key technical requirement for guaranteeing operation safety and avoiding accidents. Taking the field of transportation as an example, with the rapid increase of the quantity of automobile conservation and the development of intelligent driving technology, a driver state monitoring technology has become an important means for improving road traffic safety. According to statistics, the proportion of traffic accidents caused by abnormal emotion of a driver exceeds 30%, including road anger, fatigue driving, distraction and other conditions, and accurate identification of the emotion state of the driver has important significance for preventing traffic accidents and improving driving safety. The existing emotional state monitoring system mainly has the following problems that firstly, the identification dimension is single, most of the existing emotional state monitoring system can only identify basic discrete emotion categories, such as basic emotion like happy, grive, fear and the like, quantitative analysis on multidimensional information like emotion intensity, duration, change trend and the like is lacking, the real emotional state cannot be comprehensively reflected, secondly, the environment adaptability is poor, if a driving environment has the particularity of severe illumination change, vibration interference, multitasking operation and the like, the identification accuracy of the existing emotional state monitoring system is greatly reduced under a complex scene, the practicability is limited, thirdly, the existing emotional state monitoring system lacks time sequence modeling, the time sequence characteristic of the emotional state is mainly ignored and the dynamic change process of emotion cannot be captured based on single-frame images or simple feature fusion, and fourthly, the risk assessment is insufficient, the existing emotional state monitoring system lacks an assessment mechanism for associating the emotional state with the behavior risk, and cannot provide effective support for safety early warning. In addition, conventional emotion recognition methods rely primarily on static facial expression analysis, classification by extracting facial key point features or using deep learning networks. However, these methods face many challenges in the driving environment, on one hand, the facial expression of the driver during driving tends to be subtle, and it is difficult for the conventional methods to accurately capture these subtle changes, unlike the expression in daily life which is obvious. On the other hand, factors such as illumination change, head movement, face shielding and the like in the running process of the vehicle can seriously influence the recognition precision, so that the reliability of a method based on a static image is insufficient. In recent years, some researchers have begun to attempt to fuse multi-modal information to improve emotion recognition accuracy, such as combining facial expressions and physiological signals. However, these methods often require additional sensor devices, which increase system complexity and cost, and have problems of strong invasiveness and low user acceptance in practical applications. Meanwhile, the existing multi-mode information fusion method mostly adopts simple characteristic splicing or weighted average, lacks deep modeling of time sequence relations among different mode information, and has limited fusion effect. In terms of time series modeling, although some researchers begin to pay attention to time information in a video sequence, most methods are still limited to analysis of short-time sequences, cannot effectively capture long-term evolution trend of emotional states, and meanwhile, lack of consideration of driving environment specificity, cannot specifically design an algorithm architecture suitable for driving scenes. Disclosure of Invention The invention aims to solve the technical problems that the existing emotion state monitoring system and method is single in recognition dimension, poor in environmental adaptability and lack of time sequence modeling, so that emotion recognition accuracy is low, and long-term evolution trend of an emotion state is difficult to capture effectively. In order to achieve the above purpose, the invention adopts the following technical scheme: The emotion state dynamic monitoring system based on the space-time characteristics of the vi