US-12626534-B2 - Methods for featureless gaze tracking in ecologically valid conditions

US12626534B2US 12626534 B2US12626534 B2US 12626534B2US-12626534-B2

Abstract

Systems and methods are disclosed for gaze tracking. A method includes receiving a video of a user taken by a front-facing camera of a device having a screen, receiving dimensions of the screen, parsing the video into a series of uniform-dimension video frame images, inputting the series of uniform-dimension video frame images to a pretrained artificial neural network, thereby extracting a plurality of features from the series of uniform-dimension video frame images to determine a set of internal spatial hierarchy features on each of the uniform-dimension video frame images, inputting each set of hierarchy features to a fully connected layer, the fully connected layer producing an intermediate physical estimate of the user's gaze location on the screen relative to the device camera in centimeters, determining a series of screen locations based on the intermediate physical estimate and the dimensions of the screen, and labeling the gaze location on the screen.

Inventors

John Langton
Sean Tobyne
Karl Thompson

Assignees

LINUS HEALTH, INC.

Dates

Publication Date: 20260512
Application Date: 20231122

Claims (20)

1 . A method, comprising: receiving a video of a user taken by a front-facing camera of a device having a screen; receiving dimensions of the screen; parsing the video into a series of uniform-dimension video frame images; inputting the series of uniform-dimension video frame images to an artificial neural network that is pretrained, thereby extracting a plurality of features from the series of uniform-dimension video frame images to determine a set of internal spatial hierarchy features on each of the uniform-dimension video frame images; inputting each set of internal spatial hierarchy features to a fully connected layer, the fully connected layer producing an intermediate physical estimate of the user's gaze location on the screen relative to the front-facing camera; determining a series of screen locations based on the intermediate physical estimate and the dimensions of the screen; and labeling the gaze location on the screen.
2 . The method of claim 1 , wherein the artificial neural network is pretrained by: receiving a dataset of training videos, the training videos showing a variety of user environmental conditions, facial coverings, and user demographics; receiving a ground truth gaze location for each training video in the dataset; parsing each training video into a series of video frame images, where each video frame image includes the ground truth gaze location; training the artificial neural network by processing the dataset to predict a gaze location for each video frame image; and iteratively reducing a difference between the predicted gaze location and the ground truth gaze location until a stable series of neural network weights is determined.
3 . The method of claim 2 , wherein the variety of user environmental conditions comprises multiple environmental illumination settings.
4 . The method of claim 2 , wherein the training videos have a range of distances between the user and the camera.
5 . The method of claim 1 , wherein the uniform-dimension video frame images have a same pixel dimension and are of a same file type.
6 . The method of claim 1 , wherein an approximate location of the user's gaze is determined by the user tracking one or more objects across the screen.
7 . The method of claim 1 , further comprising: determining at least one pattern between extracted features; and associating each extracted feature with a labeled predicted gaze location on each uniform-dimension video frame image.
8 . The method of claim 7 , wherein patterns are determined across the series of uniform-dimension video frame images.
9 . The method of claim 1 , wherein the extracted features comprise a spatial relationship of a component of the series of uniform-dimension video frame images, represented by a numerical array.
10 . The method of claim 9 , wherein the extracted features are ordered based on relevance to a gaze target prediction.
11 . The method of claim 1 , where image-specific features are of higher relevance.
12 . The method of claim 1 , wherein the artificial neural network is a convolutional neural network.
13 . The method of claim 1 , wherein the device is a tablet.
14 . A device, comprising: a front-facing camera; a screen; at least one memory storing instructions; and at least one processor configured to execute the instructions to perform operations comprising: receiving a video of a user taken by the front-facing camera of the device having the screen; receiving dimensions of the screen; parsing the video into a series of uniform-dimension video frame images; inputting the series of uniform-dimension video frame images to a pretrained artificial neural network, thereby extracting a plurality of features from the series of uniform-dimension video frame images to determine a set of internal spatial hierarchy features on each of the uniform-dimension video frame images; inputting each set of internal spatial hierarchy features to a fully connected layer, the fully connected layer producing an intermediate physical estimate of the user's gaze location on the screen relative to the front-facing camera; determining a series of screen locations based on the intermediate physical estimate and the dimensions of the screen; and labeling the gaze location on the screen.
15 . A non-transitory computer readable medium storing instructions that, when executed by a processor, cause the processor to perform a method comprising: receiving a video of a user taken by a front-facing camera of a device having a screen; receiving dimensions of the screen; parsing the video into a series of uniform-dimension video frame images; inputting the series of uniform-dimension video frame images to an artificial neural network that is pretrained, thereby extracting a plurality of features from the series of uniform-dimension video frame images to determine a set of internal spatial hierarchy features on each of the uniform-dimension video frame images; inputting each set of internal spatial hierarchy features to a fully connected layer, the fully connected layer producing an intermediate physical estimate of the user's gaze location on the screen relative to the front-facing camera; determining a series of screen locations based on the intermediate physical estimate and the dimensions of the screen; and labeling the gaze location on the screen.
16 . The non-transitory computer readable medium of claim 15 , wherein the artificial neural network is pretrained by: receiving a dataset of training videos, the training videos showing a variety of user environmental conditions, facial coverings, and user demographics; receiving a ground truth gaze location for each training video in the dataset; parsing each training video into a series of video frame images, where each video frame image includes the ground truth gaze location; training the artificial neural network by processing the dataset to predict a gaze location for each video frame image; and iteratively reducing a difference between the predicted gaze location and the ground truth gaze location until a stable series of neural network weights is determined.
17 . The non-transitory computer readable medium of claim 15 , wherein the uniform-dimension video frame images have a same pixel dimension and are of a same file type.
18 . The non-transitory computer readable medium of claim 15 , wherein an approximate location of the user's gaze is determined by the user tracking one or more objects across the screen.
19 . The non-transitory computer readable medium of claim 15 , further comprising: determining at least one pattern between extracted features; and associating each extracted feature with a labeled predicted gaze location on each uniform-dimension video frame image.
20 . The non-transitory computer readable medium of claim 19 , wherein patterns are determined across the series of uniform-dimension video frame images.

Description

RELATED APPLICATION(S) This application claims the benefit of priority to U.S. Provisional Application No. 63/427,300, filed on Nov. 22, 2022, which is hereby incorporated by reference in its entirety. TECHNICAL FIELD The present disclosure relates generally to tracking a subject's gaze and, in particular, to systems and methods for gaze tracking of tablet users through analysis of video captured from a tablet's front facing video camera via a computer vision-based deep learning model. BACKGROUND Gaze tracking may be achieved through facial feature extraction from a front-facing camera. In an exemplary approach, machine learning algorithms are used to extract facial features from video frames and then estimate the user's gaze target using deep learning models that analyze the extracted facial features. One main drawback to this approach is that facial feature extraction methods are often prone to failure in the presence of complex artifacts in the captured video. Complex artifacts include face masks or coverings. Other issues arising during facial feature extraction include video low lighting conditions, varying distance between the subject and the camera, background movement, and other common video failures. Such failures can lead to partial or complete loss of gaze tracking capability. Accordingly, a method of gaze tracking that does not rely on facial feature extraction is needed. Instead of relying on facial features, disclosed embodiments provide a method that passes raw video footage to a complex deep convolutional neural network that is able to intrinsically extract important features from video frames even in the presence of the aforementioned artifacts and issues, and propagate these important features throughout the network, culminating in an output representing a gaze target location for each video frame. Furthermore, for network training purposes, a custom ecologically valid dataset including videos of users wearing face masks and/or standing in varying lighting conditions and distances from a camera can be employed to enhance predictive capabilities in harsh conditions. This custom dataset results in the production of a gaze prediction for the entirety of the given captured video, which is crucial in applications requiring continuous, uninterrupted gaze tracking. The foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. The background description provided herein is for the purpose of generally presenting the context of the disclosure. Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to be prior art, or suggestions of the prior art, by inclusion in this section. SUMMARY According to certain aspects of the present disclosure, systems and methods are disclosed for tracking gaze location of a user on a screen. In one embodiment, a method includes: receiving a video of a user taken by a front-facing camera of a device having a screen; receiving dimensions of the screen; parsing the video into a series of uniform-dimension video frame images; inputting the series of uniform-dimension video frame images to a pretrained artificial neural network, thereby extracting a plurality of features from the series of uniform-dimension video frame images to determine a set of internal spatial hierarchy features on each of the uniform-dimension video frame images; inputting each set of hierarchy features to a fully connected layer, the fully connected layer producing an intermediate physical estimate of the user's gaze location on the screen relative to the device camera; determining a series of screen locations based on the intermediate physical estimate and the dimensions of the screen; and labeling the gaze location on the screen. In another embodiment, a device includes: a front-facing camera, a screen, at least one memory storing instructions, and at least one processor configured to execute the instructions to perform operations including: receiving a video of a user taken by a front-facing camera of a device having a screen, receiving dimensions of the screen, parsing the video into a series of uniform-dimension video frame images, inputting the series of uniform-dimension video frame images to a pretrained artificial neural network, thereby extracting a plurality of features from the series of uniform-dimension video frame images to determine a set of internal spatial hierarchy features on each of the uniform-dimension video frame images, inputting each set of hierarchy features to a fully connected layer, the fully connected layer producing an intermediate physical estimate of the user's gaze location on the screen relative to the device camera, determining a series of screen locations based on the intermediate physical estimate and the dimensions of the screen, and labeling the gaze location on the screen. In an alt