CN-121982682-A - Causal consistency regular-driven driver emotion state attribution method

CN121982682ACN 121982682 ACN121982682 ACN 121982682ACN-121982682-A

Abstract

The invention discloses a causal consistency regular-driven driver emotion state attribution method and device, and relates to the technical field of computer vision. The method comprises the steps of obtaining a multisource video sequence inside and outside a vehicle, dividing the multisource video sequence into an external environment sequence and a vehicle interior driver environment sequence through time sequence slicing and region division, extracting space-time joint characteristics in a shared three-dimensional convolution classification neural network, designing a causal consistency constraint module based on do-intervention, carrying out structural neutral intervention on the external environment subsequence and the vehicle interior driver environment subsequence at a characteristic level, constructing a counterfactual sample, and establishing KL divergence constraint with an original prediction result so as to weaken influence of irrelevant interference characteristics on the classification result. And obtaining the trained three-dimensional convolution classified neural network through the joint optimization of the cross entropy loss and the intervention consistency loss. The method and the device can improve the accuracy of the classification result of the emotional state of the driver.

Inventors

CHEN JIANSHENG
WANG JIAJU
WU JIEHUI
LIU SIQI
NI JUAN
LUO QIFENG
ZHENG LIHAO
MA HUIMIN

Assignees

北京科技大学

Dates

Publication Date: 20260505
Application Date: 20251231

Claims (10)

1. A causal consistency canonical driven driver emotional state attribution method, the method comprising: s1, constructing a causal inspired driver emotion state attribution classification framework, wherein the framework comprises a shared three-dimensional convolution classification neural network, an intervention module and a causal consistency module; S2, acquiring a multi-source video sequence inside and outside the vehicle, and dividing the multi-source video sequence into an external environment sequence and an in-vehicle driver environment sequence by adopting a time sequence slicing method and a region dividing method; S3, inputting the external environment sequence, the in-vehicle driver environment sequence and the obtained driver emotion attribution labels into a shared three-dimensional convolution classification neural network, outputting an un-normalized score, calculating through a softmax function based on the un-normalized score, obtaining classification probability distribution, and constructing a cross entropy loss function based on the classification probability distribution; s4, inputting the environmental sequences of the driver in the vehicle and the external environmental sequences into an intervention module, and constructing intervention features corresponding to all the sequences by adopting a neutralization operator; based on the external environment sequence, constructing an internal intervention sample through the intervention characteristics corresponding to the internal driver environment sequence; S5, inputting the external intervention sample and the external environment sequence into a shared three-dimensional convolution classified neural network, and outputting a prediction distribution corresponding to the external intervention sample; S6, inputting the prediction distribution corresponding to the internal intervention sample and the prediction distribution corresponding to the external intervention sample into a causal consistency module, and constructing a KL divergence loss function; s7, constructing a joint optimization loss function based on the KL divergence loss function and the cross entropy loss function, training the shared three-dimensional convolution neural network based on the joint optimization loss function, and obtaining a trained three-dimensional convolution classification neural network; S8, acquiring a multi-source video stream in a section of window acquired in real time, performing time synchronization and alignment on the video in the vehicle and the video outside the vehicle to form a multi-source video sequence corresponding to the same moment, inputting the multi-source video sequence into an intervention module to construct an external intervention sample and an internal intervention sample, inputting the external intervention sample, the internal intervention sample and the multi-source video sequence into a trained three-dimensional convolution classification neural network to obtain an un-normalized score, and calculating and outputting emotion classification probability through softmax.
2. The causal consistency canonical driven driver emotional state attribution method according to claim 1, wherein the S4 inputs the in-vehicle driver environment sequence and the external environment sequence into an intervention module, and constructs intervention features corresponding to all the sequences by using a neutralization operator, and the intervention features are represented by the following formulas (1) - (2): (1) (2) Wherein, the Representing subsequences Average frame characteristics in the time dimension; Representing the number of time frames; indicating the batch size; representing the number of channels; Representing the height in spatial resolution; representing the width in spatial resolution; Indicating the intervention characteristics of all sub-sequences.
3. The causal consistency canonical driven driver emotional state attribution method according to claim 1, wherein the constructing an external intervention sample based on the in-vehicle driver environment sequence through intervention features corresponding to the external environment sequence is represented by the following formula (3): (3) Wherein, based on the external environment sequence, through the intervention characteristic that the in-vehicle driver environment sequence corresponds, the inside intervention sample of construction is represented by following formula (4): (4) Wherein, the Representing an external intervention sample; Representing an external environmental sequence; representing a sequence of driver environments within the vehicle; Representing an internal intervention sample; Representing intervention characteristics corresponding to the external environment sequence; representing a sub-sequence of the driver's environment within the vehicle; representing an external environment subsequence; Indicating the intervention characteristics corresponding to the environment sequence of the driver in the vehicle; Representing the neutralization operator.
4. The causal consistency canonical driven driver emotional state attribution method of claim 1, wherein the cross entropy loss function is represented by the following equation (5): (5) Wherein, the Representing cross entropy loss; Representing the number of training samples; representing the output classification probability distribution; a label representing a driver emotion attribution; Representing a multi-source video sequence inside and outside the vehicle; Parameters representing a shared three-dimensional convolutional classified neural network.
5. The causal consistency regular-driven driver emotional state attribution method according to claim 1, wherein the process of inputting the external intervention sample and the external environment sequence into the shared three-dimensional convolution classified neural network to output the prediction distribution corresponding to the external intervention sample, inputting the in-vehicle driver environment sequence and the internal intervention sample into the shared three-dimensional convolution neural network to output the prediction distribution corresponding to the internal intervention sample in S5 is represented by the following formula (6): (6) Wherein, the Representing probability distribution of the inside of the vehicle corresponding to the multi-source video sequence outside the vehicle; representing un-normalized output of the vehicle interior corresponding to the multi-source video sequence outside the vehicle; Representing a temperature parameter; representing a probability distribution corresponding to the external intervention sample or the internal intervention sample; representing the corresponding un-normalized output of the external or internal intervention sample.
6. The causal consistency canonical driven driver emotional state attribution method according to claim 1, wherein the process of constructing the KL divergence loss function is represented by the following formula (7): (7) Wherein, the Indicating KL divergence loss; representing a probability distribution corresponding to the external intervention sample or the internal intervention sample; representing probability distribution representing the interior of the vehicle corresponding to the multi-source video sequence exterior to the vehicle; indicating the desire.
7. The causal consistency canonical driven driver emotional state attribution method of claim 1, wherein the joint optimization loss function is represented by the following equation (8): (8) Wherein, the Representing joint optimization loss; representing cross entropy loss; representing causal canonical weights for balancing classification accuracy against causal consistency constraints; Representing a coefficient for compensating the effect of temperature scaling on the gradient scale; Indicating KL divergence loss.
8. A causal consistency canonical driven driver emotional state attribution device for implementing the causal consistency canonical driven driver emotional state attribution method of any of claims 1-7, the device comprising: The system comprises a first construction unit, a causal consistency module and a causal analysis unit, wherein the first construction unit is used for constructing a causal inspired driver emotion state attribution classification framework which comprises a shared three-dimensional convolution classification neural network, an intervention module and a causal consistency module; The system comprises a dividing unit, a time sequence slicing method, a region dividing method and a control unit, wherein the dividing unit is used for obtaining a multi-source video sequence inside and outside a vehicle; The second construction unit is used for inputting the external environment sequence, the in-vehicle driver environment sequence and the obtained driver emotion attribution labels into a shared three-dimensional convolution classification neural network, outputting an un-normalized score, calculating through a softmax function based on the un-normalized score, and obtaining classification probability distribution; the system comprises a third construction unit, an external intervention sample, an internal intervention sample, a neutralization operator, a first construction unit, a second construction unit, a third construction unit and a fourth construction unit, wherein the third construction unit is used for inputting an in-vehicle driver environment sequence and an external environment sequence into an intervention module, and constructing intervention features corresponding to all sequences by the neutralization operator; The system comprises a first output unit, a prediction distribution unit and a prediction unit, wherein the first output unit is used for inputting an external intervention sample and an external environment sequence into a shared three-dimensional convolution classification neural network and outputting a prediction distribution corresponding to the external intervention sample; A fourth construction unit, configured to input a prediction distribution corresponding to the internal intervention sample and a prediction distribution corresponding to the external intervention sample into a causal consistency module, and construct a KL divergence loss function; the training unit is used for constructing a joint optimization loss function based on the KL divergence loss function and the cross entropy loss function, training the shared three-dimensional convolution neural network based on the joint optimization loss function, and obtaining a trained three-dimensional convolution classification neural network; The system comprises a first output unit, a second output unit, an external intervention sample and a three-dimensional convolution classification neural network, wherein the first output unit is used for acquiring a multi-source video stream in a section of window acquired in real time, carrying out time synchronization and alignment on an in-vehicle video and an out-vehicle video to form a multi-source video sequence corresponding to the same moment, inputting the multi-source video sequence into the external intervention sample and the internal intervention sample, inputting the external intervention sample, the internal intervention sample and the multi-source video sequence into the trained three-dimensional convolution classification neural network to obtain an un-normalized score, and calculating and outputting emotion classification probability through softmax.
9. A causal consistency canonical driven driver emotional state attribution device, the causal consistency canonical driven driver emotional state attribution device comprising: A processor; A memory having stored thereon computer readable instructions which, when executed by the processor, implement the method of any of claims 1 to 7.
10. A computer readable storage medium having stored therein program code which is callable by a processor to perform the method of any one of claims 1 to 7.

Description

Causal consistency regular-driven driver emotion state attribution method Technical Field The invention relates to the technical field of computer vision, in particular to a causal consistency regular-driven driver emotion state attribution method and device. Background It is well known that videos in driving scenes contain rich spatiotemporal and semantic information in different sensor and acquisition modes. The cabin camera can carefully capture the facial expression, gaze, head posture and upper body actions of the driver, and the cabin external (road/front view) camera provides external event clues consistent with human driving perception, such as front vehicle braking, pedestrian crossing and road morphological changes. Besides visible light, the near infrared thermal image and the depth sensing of the vehicle gauge CAN also stably acquire key characteristics under the conditions of weak light and strong backlight, and meanwhile, the microphone and a vehicle body bus (such as a corner, a vehicle speed and a brake CAN signal) provide complementary behavior and dynamics evidence for judging the driving state. In the field of intelligent cockpit and active safety, driver emotional state identification is one of the core capabilities of driver monitoring systems (Driver Monitoring System, DMS). Unlike simple emotional classification, emotional state attribution further requires that the model compromise "what state" and "what is going on", i.e., that the discrimination be in correspondence with a specific source, e.g., triggered by the driver's own factors, or induced by external traffic events. In order to realize visual and reliable attribution output, in recent years, a deep learning method gradually replaces the traditional manual characteristic scheme, namely a three-dimensional convolution network, a space-time converter, a double/multi-stream (multi-stream) structure and a codec module are widely introduced so as to adaptively extract distinguishing elements of cross time, space and channel from an original video. Meanwhile, emotion attribution is often linked with fatigue or distraction detection, risk assessment and follow-up tasks of an active intervention strategy so as to comprehensively assess and utilize the practical value of model output. The current driver emotion attribution framework faces several challenges including (1) causal dependency loss in the feature extraction link. Existing networks typically accomplish feature representation through cascading or parallel space-time-channel modules, belonging to mapping and fusion between different representation domains. Such architectures tend to ignore "causal dependencies between intra-cabin/extra-cabin sources", lack of mechanisms for explicit intervention on specific sources, and cause models to be erroneously dependent on lighting, texture, or background motion artifact-related factors, so that deviations appear in attribution, and key microexpressions and fine-grained motion cues are also easily weakened in the fusion process. (2) Source homogeneity and robustness on training strategies are inadequate. To train a generic classifier, it is common today to feed frames or modalities of different scenes, different sources, into the network at the same time and to use cross entropy for end-to-end optimization without explicitly distinguishing the structural differences of "cause-result" from source heterogeneity. The method is easy to be unstable when the camera position, the period and the domain are shifted, and the problems of unbalanced category, weak labels and environmental distribution drift in actual data are superposed, so that the model is more prone to learning easily-obtained background clues instead of real causative signals, and the credibility and the interpretability of attribution are affected. In recent years, the wide application of deep learning in intelligent cockpit and driving behavior analysis has greatly driven the shift of driver state recognition from "manual feature+traditional classifier" to "end-to-end spatiotemporal modeling". The method represented by the convolutional neural network and the space-time network can automatically extract the distinguishing characteristics under the complex illumination, shielding and dynamic background, and remarkably improves the classification accuracy and the deployment efficiency. Meanwhile, with the popularization of vehicle-gauge cameras, infrared and multi-sensor data, researchers begin to explore the joint utilization of intra-cabin (driver face, head posture and gaze) and off-cabin (road conditions, traffic participants and weather) information, and provide richer contextual clues for emotion state judgment. At the video level, existing methods can be broadly divided into two categories, one is a two-stage scheme of "frame-level appearance+temporal aggregation", such as 2D CNN extraction of apparent features, and then LSTM/Temporal Pooling aggregation, and the other i