US-20260127897-A1 - Multistage Audio-Visual Automotive Cab Monitoring

US20260127897A1US 20260127897 A1US20260127897 A1US 20260127897A1US-20260127897-A1

Abstract

Described is a task for an automobile interior having at least one subject that creates a video input, an audio input, and a context descriptor input. The video input relates to the at least one subject and is processed by a face detection module and a facial point registration module to produce a first output. The first output is further processed by at least one of: a facial point tracking module, a head orientation tracking module, a body tracking module, a social gaze tracking module, and an action unit intensity tracking module. The audio input relating to the at least one subject is processed by a valence and arousal affect states tracking module to produce a second output and to produce a valence and arousal scores output. A temporal behavior primitives buffer produce a temporal behavior output. Based on the foregoing, a mental state prediction module predicts the mental state of at least one subject in the automobile interior.

Inventors

Michel François VALSTAR
Anthony Brown
Timur ALMAEV
Thomas James Smith
Tze Ee Yong
Mani Kumar Tellamekala

Assignees

BLUESKEYE AI LTD

Dates

Publication Date: 20260507
Application Date: 20251231

Claims (11)

1 - 6 . (canceled)
7 . A system comprising: a task for an automobile interior having at least one subject that creates a video input; an extractor for extracting facial features data relating to the at least one subject from the video input; wherein the facial features date is processed by a recurrent neural network to produce predictions related to which of the at least one subject created a sound of interest.
8 . The system as in claim 7 , wherein the facial features data comprise facial muscular actions.
9 . The system as in claim 8 , wherein the facial muscular actions comprise movement of lips.
10 . The system as in claim 7 , wherein the facial features data comprise geometric facial actions.
11 . The system as in claim 10 , wherein the facial features data comprise geometric facial actions.
12 . The system as in claim 11 , wherein the geometric facial actions comprise movements of lips and a nose.
13 . The system as in claim 7 , further comprising: a trainer to train the recurrent neural network of temporal relationships between the sound of interest and facial appearance over a specified time window via videos of facial muscular actions.
14 . The system as in claim 13 , wherein the videos of facial muscular actions have between 15 and 30 frames per second.
15 . The system as in claim 13 , wherein the recurrent neural network does not use audio input to produce the predictions.
16 - 29 . (canceled)

Description

PRIOR APPLICATIONS This application claims the benefit of the following application, which is incorporated by reference in its entirety: U.S. Provisional Patent Application No. 63/370,840, filed on Aug. 9, 2022. FIELD OF THE DISCLOSURE The present disclosure relates generally to improved techniques in monitoring audio-visual activity in automotive cabs. BACKGROUND Monitoring drivers is necessary for safety and regulatory reasons. In addition, passenger behavior monitoring is becoming more important to improve user experience and provide new features such as health and well-being-related functions. Automotive cabins are a unique multi-occupancy environment that has a number of challenges when monitoring human behavior. These challenges include: Significant visual noise caused by rapidly changing and varied lighting conditions:Significant audio noise from the road, radios and open windows;Suboptimal camera angles lead to frequent occlusion and extreme head pose; andMulti-occupancy can lead to confusion about the source of audio signals or the potential focus of attention. Current in-cab monitoring solutions, however, rely solely on visual monitoring via cameras and are focused on driver safety monitoring. As such these systems are limited in their accuracy and capability. A more sophisticated system is needed for in-cab monitoring and reporting. SUMMARY This disclosure proposes a confidence-aware stochastic process regression-based audio-visual fusion approach to in-cab monitoring. It assesses the occupant's mental state in two stages. First, it determines the expressed face, voice, and body behaviors as can be readily observed. Second, it then determines the most plausible cause for this expressive behavior, or provides a short list of potential causes with a probability for each that it was the root cause of the expressed behavior. The multistage audio-visual approach disclosed herein significantly improves accuracy and enables new capabilities not possible with a visual-only approach in an in-cab environment. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying figures, where like reference numerals refer to identical or functionally similar elements throughout the separate views, together with the detailed description below, are incorporated in and form part of the specification, serve to further illustrate embodiments of concepts that include the claimed invention and explain various principles and advantages of those embodiments. FIG. 1 shows an architecture of inputs and outputs for an in-cab temporal behavior pipeline. FIG. 2 shows an overview of a structure of a Visual Voice Activity Detection model. FIG. 3 shows the accuracy of a Visual Voice Activity Detection model. FIG. 4 shows the comparison of a 1-second buffer and a 2-second buffer of a Visual Voice Activity Detection model. FIG. 5 shows the comparison of F1, precision, recall, and accuracy for Visual Voice Activity Detection model and an Audio Voice Activity Detection model. FIG. 6 shows a block diagram of a confidence-aware audio-visual fusion model. FIGS. 7A, 7B, and 7C show evidence of improved accuracy and reduced false positive rate for a noise-aware audio-visual fusion technique. Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of embodiments of the present invention. The apparatus and method components have been represented where appropriate by conventional symbols in the drawings, showing only those specific details that are pertinent to understanding the embodiments of the present invention so as not to obscure the disclosure with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein. DETAILED DESCRIPTION I. DEFINITIONS AND EVALUATION METRICS In this disclosure, the following definitions will be used: AU-Action Unit, the fundamental actions of individual muscles or groups of muscles, identified by FACS (Facial Action Coding System), which was updated in 2002;VVAD-Visual Voice Activity Detection (processed exclusive of any audio); andAVAD-Audio Voice Activity Detection (processed exclusive of any video).The evaluation metrics used to verify the models' performance are the followingPrecision is defined as the percentage of correctly identified positive class data points from all data points identified as the positive class by the model.Recall is defined as the percentage of correctly identified positive class data points from all data points that are labelled as the positive class.F1 is a metric that measures the model's accuracy performance by calculating the harmonic mean of the precision and recall of the model. F1 is calculated as follows: F⁢1=2⁢precision*recallprecision+recall F1 is a commonly used because it reliably measu