EP-4150521-B1 - DYNAMIC VISION SENSOR FOR VISUAL AUDIO PROCESSING

EP4150521B1EP 4150521 B1EP4150521 B1EP 4150521B1EP-4150521-B1

Inventors

YE, XIAOYONG
NAKAMURA, YUICHIRO

Dates

Publication Date: 20260506
Application Date: 20210421

Claims (6)

A system comprising: at least one camera unit (304) configured to generate red-green-blue, RGB, images and/or infrared, IR, images of a person; at least one microphone (302); at least one event driven sensor (306), EDS, configured to output signals representative of the person; and at least one processor (400) programmed with instructions to: process the output of the microphone using a short term Fourier transform, STFT; process the output of the STFT using at least one audio processing convolutional neural network, CNN; process at least features in images from the camera unit using at least one visual processing CNN; process representations of output signals from the EDS using at least one event processing CNN; and fuse outputs of the CNNs in fully connected neural network layers to generate at least one of: a prediction of emotion of the person, tracking of at least a portion of the face of the person, at least one virtual reality, VR, image of the person, an identification of the person.
The system of Claim 1, wherein the camera unit, processor, and EDS are disposed on a single chip.
The system of Claim 1, wherein the tracked portion of the face of the person comprises one or more selected from the list consisting of: i. at least one eye pupil; ii. corners of the mouth; and iii. the interior of the mouth including teeth.
The system of Claim 1, wherein the processor is configured with instructions to: process outputs of the CNNs using a recurrent neural network, RNN; and process output of the RNN using the fully connected neural network layers to generate mouth tracking of the person.
A computer-implemented method comprising: receiving (802) signals from at least one camera unit configured to generate red-green-blue, RGB, images and/or infrared, IR, images of a person; receiving (804) signals from at least one event-driven sensor, EDS configured to output signals representative of the person; receiving (800) signals from at least one microphone; and processing the output of the microphone using a short term Fourier transform, STFT; processing the output of the STFT using at least one audio processing convolutional neural network, CNN; processing at least features in images from the camera unit using at least one visual processing CNN; processing representations of output signals from the EDS using at least one event processing CNN; and fusing outputs of the CNNs in fully connected neural network layers to generate at least one of: a prediction of emotion of the person, tracking of at least a portion of the face of the person, at least one virtual reality, VR, image of the person, an identification of the person.
The method of Claim 5, wherein the portion of the face is the corners of the mouth and interior of the mouth.

Description

FIELD The application relates generally to technically inventive, non-routine solutions that are necessarily rooted in computer technology and that produce concrete technical improvements. BACKGROUND When executing facial tracking during speech to identify speakers in a noisy environment, or to detect fake videos, or for speech recognition to resolve ambiguity, or for animation, or for other purposes, some parts of face such as dark areas inside the mouth, teeth, and the quick motion of facial structures during speech pose challenges to precise tracking. Previously proposed arrangements are disclosed in US 2017/134694 A1. SUMMARY The technical challenge posed by the above is that for better operation, high speed cameras may be required to reduce latency and improve tracking performance, requiring increased camera data framerates, yet such higher framerates requires higher bandwidth and processing and, thus, a relatively large consumption of power and generation of heat. To address the challenges noted herein, a camera sensor system is provided that includes not only sensor cells with both light intensity photodiodes under color and if desired infrared filters to capture RGB and IR images, but also an event driven sensor (EDS) sensing cells which detect motion by virtue of EDS principles. EDS uses the change of light intensity as sensed by one or more camera pixels as an indication of motion. EDS has a high dynamic range (HDR), no motion blur, and low latency compared to RGB cameras. When EDS information is fused with RGB camera information and audio information, tracking is made more robust. In conditions of fast motion (e.g., of the mouth) or HDR, EDS information may be relied on relatively more than in conditions of slow motion and fine detail (color, texture), in which conditions camera images are relied on more. Such fusion also can apply to face tracking, eye tracking, and emotion recognition. Present principles use raw events data from an EDS to fuse with RGB camera and audio data and then input to a classifier. The classifier is trained using a training set of all three inputs from audio/camera/events data, in some implementations using a recurrent neural network with convolutional layers. The present invention is defined by the appended claims. In non-limiting embodiments the portion of the face being tracked may be one or more eyes and specifically one or more pupils, and may be limited to the pupils or may include other facial features. In other embodiments the portion comprises corners of the mouth and may be limited to the corners of the mouth and/or interior of the mouth including the teeth, or may include additional facial features as well. In one aspect of the invention, a system includes at least one camera unit configured to generate red-green-blue (RGB) images and/or infrared (IR) images of a person. The system also includes at least one microphone and at least one event driven sensor (EDS) configured to output signals representative of the person. The system further includes at least one processor programmed with instructions to process output of the microphone using a short term Fourier transform (STFT) and process output of the STFT using at least one audio processing convolutional neural network (CNN). The instructions are executable to process at least features in images from the camera unit using at least one visual processing CNN. Furthermore, the instructions are executable to process representations of output signals from the EDS using at least one event processing CNN. The instructions in the system can be executed to fuse outputs of the CNNs in fully connected neural network layers to generate one or more of a prediction of emotion of the person, tracking of at least a portion of the face of the person, at least one virtual reality (VR) image of the person, and an identification of the person. In one example of this latter aspect, the processor can be configured with instructions to process outputs of the CNNs using a recurrent neural network (RNN), and process output of the RNN using the fully connected neural network layers to generate mouth tracking of the person. In another aspect of the invention, a method is provided according to claim 5. The details of the present application, both as to its structure and operation, can best be understood in reference to the accompanying drawings, in which like reference numerals refer to like parts, and in which: BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a block diagram of an example system including an example in accordance with present principles;Figure 2 illustrates a simplified sensor data flow;Figure 3 illustrates sensors in relation to a person's face being tracked;Figure 4 illustrates an example system in block diagram format;Figure 5 illustrates data flow from RGB input, audio input, and EDS input for emotion recognition or speaker recognition;Figure 6 illustrates an alternative classifier architecture for speech recognition;