US-20260127805-A1 - MULTIMODAL DIGITAL HUMAN INTERACTION SYSTEM

US20260127805A1US 20260127805 A1US20260127805 A1US 20260127805A1US-20260127805-A1

Abstract

Disclosed are apparatuses, systems, and techniques for a multimodal interaction system for digital humans with real-time engagement and pose analysis, which receive a video stream comprising a plurality of frames depicting at least a portion of a user, wherein the video stream is associated with an interaction of the user with an avatar; determine, for at least one frame of the plurality of frames, a pose orientation corresponding to at least one of one or more body landmarks of the user represented in the corresponding frame; determine, based on at least one of a series of pose orientations corresponding to the plurality of frames, an engagement metric of the user; and cause a representation of the avatar performing an action based on the engagement metric to be generated.

Inventors

Guilhem Marie Andre Pierre Bau
Tarun Jawahar Rathor
Rohit Ramesh Vaswani
Severin Achill Klingler
Pascal Joël Bérard

Assignees

NVIDIA CORPORATION

Dates

Publication Date: 20260507
Application Date: 20251106

Claims (20)

1 . A method comprising: receiving a video stream comprising a plurality of frames depicting at least a portion of a user, wherein the video stream is associated with an interaction of the user with an avatar; determining, for at least one frame of the plurality of frames, a pose orientation corresponding to one or more body landmarks of the user represented in the corresponding frame; determining, based on at least one pose orientation of a series of pose orientations corresponding to the plurality of frames, an engagement metric of the user; and causing a representation of the avatar performing an action based on the engagement metric to be generated.
2 . The method of claim 1 , further comprising: identifying the one or more body landmarks of the user by providing each frame of the at least one frame of the plurality of frames to a machine learning model that processes the frame and outputs a set of coordinates for each body landmark of the one or more body landmarks.
3 . The method of claim 2 , wherein the machine learning model further outputs a confidence score for each body landmark of the one or more body landmarks of the user represented in the corresponding frame, the method further comprising: determining that the confidence score of at least one of the one or more body landmarks exceeds a threshold value; and including the pose orientation of the corresponding frame in the series of pose orientations.
4 . The method of claim 1 , further comprising: applying an exponential weighted average on the series of pose orientations corresponding to the plurality of frames to smooth one or more user movements across the series of pose orientations.
5 . The method of claim 1 , wherein determining the engagement metric of the user comprises: comparing the at least one pose orientation of the series of pose orientations to a predetermined user engagement condition, wherein the engagement metric corresponds to a result of the comparison.
6 . The method of claim 1 , further comprising: identifying an audio stream corresponding to the video stream; determining speech timing data of the audio stream; and determining a correlation between the series of pose orientations corresponding to the plurality of frames and the speech timing data of the audio stream, wherein the engagement metric of the user is further based on the correlation.
7 . The method of claim 6 , wherein determining the correlation between the series of pose orientations corresponding to the plurality of frames and the speech timing data of the audio stream comprises: determining an audio processing latency based on an utterance length of the audio stream and a voice activity detection delay associated with the audio stream; and aligning a timestamp of the engagement metric with a portion of the audio stream corresponding to the utterance length using the audio processing latency to fuse the engagement metric with the speech timing of the audio stream.
8 . The method of claim 6 , further comprising: determining a statistical distribution of the series of pose orientations corresponding to the audio stream; and determining, based on the statistical distribution, a percentage of time during the audio stream that the engagement metric satisfies a threshold, wherein the engagement metric corresponds to the percentage of time.
9 . The method of claim 1 , wherein at least one pose orientation of the one or more pose orientations represents rotational parameters of a head of the user.
10 . The method of claim 1 , wherein responsive to determining that the engagement metric satisfies a disengaged criterion, the action causes the representation of the avatar to not respond to the interaction of the user corresponding to the video stream.
11 . The method of claim 1 , wherein responsive to determining that the engagement metric satisfies a distracted criterion, the action causes the representation of the avatar to (1) request clarification regarding an intent of the user behind the interaction, (2) implement an attention-recovery strategy conversation, or (3) implement temporal buffering.
12 . The method of claim 1 , wherein responsive to determining that the engagement metric satisfies an attentive criterion, the action causes the representation of the avatar to maintain conversational flow with a standard response timing.
13 . A system comprising: one or more processing units to: receive a video stream comprising a plurality of frames depicting at least a portion of a user, wherein the video stream is associated with an interaction of the user with an avatar; determine, for at least one frame of the plurality of frames, a pose orientation corresponding to one or more body landmarks of the user represented in the corresponding frame; determine, based on at least one pose orientation of a series of pose orientations corresponding to the plurality of frames, an engagement metric of the user; and cause a representation of the avatar performing an action based on the engagement metric to be generated.
14 . The system of claim 13 , wherein the one or more processing units further to: identify the one or more body landmarks of the user by providing each of the at least one frame of the plurality of frames to a machine learning model that processes the frame and outputs a set of coordinates for each body landmark of the one or more body landmarks, wherein the machine learning model further outputs a confidence score for each body landmark of the one or more body landmarks of the user represented in the corresponding frame; determine that the confidence score of at least one of the one or more body landmarks exceeds a threshold value; and include the pose orientation of the corresponding frame in the series of pose orientations.
15 . The system of claim 13 , wherein the one or more processing units further to: apply an exponential weighted average on the series of pose orientations corresponding to the plurality of frames to smooth one or more user movements across the series of pose orientations.
16 . The system of claim 13 , wherein the one or more processing units further to: identify an audio stream corresponding to the video stream; determining speech timing data of the audio stream; and determining a correlation between the series of pose orientations corresponding to the plurality of frames and the speech timing data of the audio stream, wherein the engagement metric of the user is further based on the correlation; determine a statistical distribution of the series of pose orientations corresponding to the audio stream; and determine, based on the statistical distribution, a percentage of time during the audio stream that the engagement metric satisfies a threshold, wherein the engagement metric corresponds to the percentage of time.
17 . The system of claim 16 , wherein to determine the correlation between the series of pose orientations corresponding to the plurality of frames and the speech timing data of the audio stream, the one or more processing units further to: determine an audio processing latency based on an utterance length of the audio stream and a voice activity detection delay associated with the audio stream; and align a timestamp of the engagement metric with a portion of the audio stream corresponding to the utterance length using the audio processing latency to fuse the engagement metric with the speech timing data of the audio stream.
18 . The system of claim 13 , wherein responsive to determining that the engagement metric satisfies a disengaged criterion, the action causes the representation of the avatar to not respond to the interaction of the user corresponding to the video stream.
19 . The system of claim 13 , wherein responsive to determining that the engagement metric satisfies a distracted criterion, the action causes the representation of the avatar to (1) request clarification regarding an intent of the user behind the interaction, (2) implement an attention-recovery strategy conversation, or (3) implement temporal buffering.
20 . One or more processors comprising: circuitry to control a digital human interaction based on a user engagement metric determined based on at least one of a series of pose orientations corresponding to a plurality of frames of a video stream received during an interaction of a user with the digital human.

Description

RELATED APPLICATIONS This application claims the benefit of U.S. Provisional Application No. 63/717,883, filed Nov. 7, 2024, the entire contents of which are incorporated herein by reference. TECHNICAL FIELD At least one embodiment pertains to systems and techniques for implementing a multimodal interaction system for digital humans. BACKGROUND Digital human and conversational AI systems have become increasingly prevalent in applications ranging from customer service and healthcare to entertainment and education. These systems typically rely on voice-based interactions where users speak to digital avatars or chatbots that process audio input through automatic speech recognition, and respond with synthesized speech or text. Current solutions focus primarily on understanding the semantic content of user utterances, and generating appropriate textual or vocal responses based on natural language processing algorithms. BRIEF DESCRIPTION OF DRAWINGS Various embodiments in accordance with the present disclosure will be described with reference to the drawings, in which: FIG. 1 is a block diagram of an example architecture of a computing system capable of performing real-time multimodal interaction system for digital humans, according to at least one embodiment; FIG. 2 is a flow diagram of an example method determining an engagement metric of a user during an interaction with a digital human, according to at least one embodiment; FIGS. 3A-B illustrate example video frames and graphs showing face angle measurements over time, according to at least one embodiment, according to at least one embodiment; FIG. 4 illustrates a block diagram of an example multimodal interaction system architecture, according to least one embodiment; FIG. 5A illustrates inference and/or training logic, according to at least one embodiment; FIG. 5B illustrates inference and/or training logic, according to at least one embodiment; FIG. 6 illustrates an example data center system, according to at least one embodiment; FIG. 7 illustrates a computer system, according to at least one embodiment; FIG. 8 illustrates a computer system, according to at least one embodiment; FIG. 9 illustrates at least portions of a graphics processor, according to one or more embodiments; FIG. 10 illustrates at least portions of a graphics processor, according to one or more embodiments; FIG. 11 is an example data flow diagram for an advanced computing pipeline, in accordance with at least one embodiment; FIG. 12 is a system diagram for an example system for training, adapting, instantiating and deploying machine learning models in an advanced computing pipeline, in accordance with at least one embodiment; and FIGS. 13A and 13B illustrate a data flow diagram for a process to train a machine learning model, as well as client-server architecture to enhance annotation tools with pre-trained annotation models, in accordance with at least one embodiment. DETAILED DESCRIPTION Modern voice and chatbot systems deployed as digital humans, interactive kiosks, or cloud-based agents lack a reliable mechanism to determine when an utterance captured from a shared acoustic environment is actually directed to the agent, or to another person or subject in the vicinity of the speaker. Current digital human interaction systems lack the ability to perceive and interpret visual cues from users, resulting in unnatural and ineffective conversations. In real-world settings, users routinely speak while glancing away, shift attention to bystanders mid-utterance, or carry on side conversations in proximity to a microphone. Existing digital avatars and chatbots cannot distinguish when users are actively engaged versus distracted, leading to inappropriate responses when users are looking away, talking to third parties, or otherwise not focused on the interaction. Traditional systems often rely solely on audio input without understanding the user's visual context. When users engage in cross-talk or speak to someone else in a room, digital avatars cannot detect this and may inappropriately respond to conversations not directed at them. Similarly, when users become visually distracted or turn away during an interaction, the system traditionally continues to operate as if the user remains fully engaged, missing important contextual cues that would inform a more natural response. The absence of real-time visual perception capabilities in digital human (also referred to as digital avatar) systems maintains a gap between human-to-human interactions, where visual engagement cues are naturally understood, and human-to-digital interactions, where such cues are ignored. This limitation reduces the effectiveness of digital humans in applications such as healthcare, customer service, and other scenarios where understanding user attention and engagement state may be important for providing appropriate responses. Furthermore, existing computer vision solutions for human pose detection often perform poorly when