US-12626704-B2 - Detecting visual attention during user speech

US12626704B2US 12626704 B2US12626704 B2US 12626704B2US-12626704-B2

Abstract

An example process includes: concurrently receiving an audio stream and a video stream; determining, based on a first portion of the audio stream received within a predetermined duration before a current time and a first portion of the video stream received within the predetermined duration before the current time, whether a visual attention of a user is directed to an electronic device while the user is speaking; and in accordance with a determination that the visual attention of the user is directed to the electronic device while the user is speaking: identifying a second portion of the audio stream to include user speech intended for the electronic device; initiating, by a digital assistant operating on the electronic device, a task based the second portion of the audio stream; and providing an output indicative of the initiated task.

Inventors

Maxwell C. HORTON
Stephen A. Berardi
Yanzi JIN
Sophie Lebrecht
Richard P. MUFFOLETTO
Daniel TORMOEN

Assignees

APPLE INC.

Dates

Publication Date: 20260512
Application Date: 20230410

Claims (20)

1 . A non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of an electronic device, cause the electronic device to: concurrently receive an audio stream and a video stream; determine, based on a first portion of the audio stream received within a predetermined duration before a current time and a first portion of the video stream received within the predetermined duration before the current time, whether a visual attention of a user is directed to the electronic device while the user is speaking, wherein the first portion of the video stream includes a plurality of video frames, and wherein determining whether the visual attention of the user is directed to the electronic device while the user is speaking includes: determining a first confidence score for a first video frame of the plurality of video frames, wherein the first video frame corresponds to a first time, wherein the first confidence score indicates, for the first video frame, whether the visual attention of the user is directed to the electronic device while the user is speaking, and wherein determining the first confidence score includes: determining an initial first confidence score based on the first video frame; and adjusting, based on processing a second video frame of the plurality of video frames, the initial first confidence score to obtain the first confidence score, wherein the second video frame corresponds to a second time after the first time; and in accordance with a determination that the visual attention of the user is directed to the electronic device while the user is speaking: identify a second portion of the audio stream to include user speech intended for the electronic device; initiate, by a digital assistant operating on the electronic device, a task based the second portion of the audio stream; and provide an output indicative of the initiated task.
2 . The non-transitory computer-readable storage medium of claim 1 , wherein determining whether the visual attention of the user is directed to the electronic device while the user is speaking includes: determining whether the visual attention of the user is directed to a display of the electronic device.
3 . The non-transitory computer-readable storage medium of claim 1 , wherein determining whether the visual attention of the user is directed to the electronic device while the user is speaking includes: determining whether the visual attention of the user is directed to an affordance displayed by the electronic device.
4 . The non-transitory computer-readable storage medium of claim 1 , wherein determining the initial first confidence score includes processing a third video frame of the plurality of video frames, wherein the third video frame corresponds to a third time before the first time.
5 . The non-transitory computer-readable storage medium of claim 1 , wherein: determining the first confidence score for the first video frame of the plurality of video frames includes determining a respective confidence score for each video frame of the plurality of video frames to obtain a plurality of respective confidence scores; the plurality of respective confidence scores include a fourth confidence score for a fourth video frame of the plurality of video frames; and identifying the second portion of the audio stream to include user speech intended for the electronic device includes: in accordance with a determination that the fourth confidence score exceeds a threshold, determining that a fourth time corresponding to the fourth video frame is the start time of the second portion of the audio stream.
6 . The non-transitory computer-readable storage medium of claim 1 , wherein: the plurality of video frames include a fifth video frame and a sixth video frame consecutive to the fifth video frame; and identifying the second portion of the audio stream to include user speech intended for the electronic device includes: in accordance with a determination that a fifth confidence score for the fifth video frame is above a second threshold and that a sixth confidence score for the sixth video frame is below the second threshold: determining that a sixth time corresponding to the sixth video frame is the end time of the second portion of the audio stream.
7 . The non-transitory computer-readable storage medium of claim 1 , wherein determining whether the visual attention of the user is directed to the electronic device while the user is speaking includes: determining, using a machine learned model, whether the visual attention of the user is directed to the electronic device while the user is speaking, including: processing a representation of the first portion of the audio stream and the first portion of the video stream using parameters of the machine learned model representing a correlation between mouth movement of the user and speech input.
8 . The non-transitory computer-readable storage medium of claim 1 , wherein determining that the visual attention of the user is directed to the electronic device while the user is speaking includes: determining that the user faces the electronic device while the user is speaking.
9 . The non-transitory computer-readable storage medium of claim 1 , wherein the one or more programs further comprise instructions, which when executed by the one or more processors, cause the electronic device to: in accordance with a determination that the visual attention of the user is not directed to the electronic device while the user is speaking: forgo identifying the second portion of the audio stream to include user speech intended for the electronic device.
10 . The non-transitory computer-readable storage medium of claim 1 , wherein the one or more programs further comprise instructions, which when executed by the one or more processors, cause the electronic device to: in accordance with a determination that the visual attention of the user is directed to the electronic device while the user is not speaking: forgo identifying the second portion of the audio stream to include user speech intended for the electronic device.
11 . The non-transitory computer-readable storage medium of claim 1 , wherein the one or more programs further comprise instructions, which when executed by the one or more processors, cause the electronic device to: in accordance with a determination that the visual attention of the user is not directed to the electronic device while the user is not speaking: forgo identifying the second portion of the audio stream to include user speech intended for the electronic device.
12 . The non-transitory computer-readable storage medium of claim 1 , wherein the one or more programs further comprise instructions, which when executed by the one or more processors, cause the electronic device to: in accordance with a determination that the user is not visible in the first portion of the video stream: forgo identifying the second portion of the audio stream to include user speech intended for the electronic device.
13 . An electronic device, comprising: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for: concurrently receiving an audio stream and a video stream; determining, based on a first portion of the audio stream received within a predetermined duration before a current time and a first portion of the video stream received within the predetermined duration before the current time, whether a visual attention of a user is directed to the electronic device while the user is speaking, wherein the first portion of the video stream includes a plurality of video frames, and wherein determining whether the visual attention of the user is directed to the electronic device while the user is speaking includes: determining a first confidence score for a first video frame of the plurality of video frames, wherein the first video frame corresponds to a first time, wherein the first confidence score indicates, for the first video frame, whether the visual attention of the user is directed to the electronic device while the user is speaking, and wherein determining the first confidence score includes: determining an initial first confidence score based on the first video frame; and adjusting, based on processing a second video frame of the plurality of video frames, the initial first confidence score to obtain the first confidence score, wherein the second video frame corresponds to a second time after the first time; and in accordance with a determination that the visual attention of the user is directed to the electronic device while the user is speaking: identifying a second portion of the audio stream to include user speech intended for the electronic device; initiating, by a digital assistant operating on the electronic device, a task based the second portion of the audio stream; and providing an output indicative of the initiated task.
14 . The electronic device of claim 13 , wherein determining whether the visual attention of the user is directed to the electronic device while the user is speaking includes: determining whether the visual attention of the user is directed to a display of the electronic device.
15 . The electronic device of claim 13 , wherein determining whether the visual attention of the user is directed to the electronic device while the user is speaking includes: determining whether the visual attention of the user is directed to an affordance displayed by the electronic device.
16 . The electronic device of claim 13 , wherein determining the initial first confidence score includes processing a third video frame of the plurality of video frames, wherein the third video frame corresponds to a third time before the first time.
17 . The electronic device of claim 13 , wherein: determining the first confidence score for the first video frame of the plurality of video frames includes determining a respective confidence score for each video frame of the plurality of video frames to obtain a plurality of respective confidence scores; the plurality of respective confidence scores include a fourth confidence score for a fourth video frame of the plurality of video frames; and identifying the second portion of the audio stream to include user speech intended for the electronic device includes: in accordance with a determination that the fourth confidence score exceeds a threshold, determining that a fourth time corresponding to the fourth video frame is the start time of the second portion of the audio stream.
18 . The electronic device of claim 13 , wherein: the plurality of video frames include a fifth video frame and a sixth video frame consecutive to the fifth video frame; and identifying the second portion of the audio stream to include user speech intended for the electronic device includes: in accordance with a determination that a fifth confidence score for the fifth video frame is above a second threshold and that a sixth confidence score for the sixth video frame is below the second threshold: determining that a sixth time corresponding to the sixth video frame is the end time of the second portion of the audio stream.
19 . The electronic device of claim 13 , wherein determining whether the visual attention of the user is directed to the electronic device while the user is speaking includes: determining, using a machine learned model, whether the visual attention of the user is directed to the electronic device while the user is speaking, including: processing a representation of the first portion of the audio stream and the first portion of the video stream using parameters of the machine learned model representing a correlation between mouth movement of the user and speech input.
20 . The electronic device of claim 13 , wherein determining that the visual attention of the user is directed to the electronic device while the user is speaking includes: determining that the user faces the electronic device while the user is speaking.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims priority to U.S. Provisional Application Ser. No. 63/456,639, entitled “DETECTING VISUAL ATTENTION DURING USER SPEECH,” filed Apr. 3, 2023, and claims priority to U.S. Provisional Application Ser. No. 63/346,693, entitled “DETECTING VISUAL ATTENTION DURING USER SPEECH,” filed May 27, 2022, the contents of which are hereby incorporated by reference in their entireties. FIELD This relates generally to determining whether a user's visual attention is directed to an electronic device while the user is speaking. BACKGROUND Intelligent automated assistants (or digital assistants) can provide a beneficial interface between human users and electronic devices. Such assistants can allow users to interact with devices or systems using natural language in spoken and/or text forms. For example, a user can provide a speech input containing a user request to a digital assistant operating on an electronic device. The digital assistant can interpret the user's intent from the speech input and operationalize the user's intent into tasks. The tasks can then be performed by executing one or more services of the electronic device, and a relevant output responsive to the user request can be returned to the user. SUMMARY Example methods are disclosed herein. An example method includes, at an electronic device having one or more processors and memory: concurrently receiving an audio stream and a video stream; determining, based on a first portion of the audio stream received within a predetermined duration before a current time and a first portion of the video stream received within the predetermined duration before the current time, whether a visual attention of a user is directed to the electronic device while the user is speaking; and in accordance with a determination that the visual attention of the user is directed to the electronic device while the user is speaking: identifying a second portion of the audio stream to include user speech intended for the electronic device; initiating, by a digital assistant operating on the electronic device, a task based the second portion of the audio stream; and providing an output indicative of the initiated task. Example non-transitory computer-readable media are disclosed herein. An example non-transitory computer-readable storage medium stores one or more programs. The one or more programs include instructions, which when executed by one or more processors of an electronic device, cause the electronic device to: concurrently receive an audio stream and a video stream; determine, based on a first portion of the audio stream received within a predetermined duration before a current time and a first portion of the video stream received within the predetermined duration before the current time, whether a visual attention of a user is directed to the electronic device while the user is speaking; and in accordance with a determination that the visual attention of the user is directed to the electronic device while the user is speaking: identify a second portion of the audio stream to include user speech intended for the electronic device; initiate, by a digital assistant operating on the electronic device, a task based the second portion of the audio stream; and provide an output indicative of the initiated task. Example electronic devices are disclosed herein. An example electronic device includes one or more processors; a memory; and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for: concurrently receiving an audio stream and a video stream; determining, based on a first portion of the audio stream received within a predetermined duration before a current time and a first portion of the video stream received within the predetermined duration before the current time, whether a visual attention of a user is directed to the electronic device while the user is speaking; and in accordance with a determination that the visual attention of the user is directed to the electronic device while the user is speaking: identifying a second portion of the audio stream to include user speech intended for the electronic device; initiating, by a digital assistant operating on the electronic device, a task based the second portion of the audio stream; and providing an output indicative of the initiated task. An example electronic device comprises means for: concurrently receiving an audio stream and a video stream; determining, based on a first portion of the audio stream received within a predetermined duration before a current time and a first portion of the video stream received within the predetermined duration before the current time, whether a visual attention of a user is directed to the electronic device while the user is speaking; and in accordance with a determination that the visual attention of