US-12620173-B2 - Hand-related data annotations with an augmented reality device

US12620173B2US 12620173 B2US12620173 B2US 12620173B2US-12620173-B2

Abstract

An augmented reality (AR) device generates hand annotations for an image depicting a user's hand. The device includes a display, a processor, and a memory storing instructions for performing operations. The device performs a calibration operation to generate a 3-D model of the user's hand based on measurements. The calibration operation prompts the user to mimic the hand gesture presented by the animated virtual representation of a hand to obtain optimal images of the hand for hand pose estimation. A 3-D virtual representation of a hand in a hand pose corresponding with a hand gesture is generated based on the 3-D model. The device presents the 3-D virtual representation of the hand in AR via the display. During presentation, the device detects an input and captures an image of the user's hand positioned to correspond with the 3-D virtual representation. The captured image is stored with corresponding hand annotations based on the 3-D model.

Inventors

Laura Rosalia Luidolt
Kai Zhou
Adrian Schoisengeier

Assignees

SNAP INC.

Dates

Publication Date: 20260505
Application Date: 20230816

Claims (19)

1 . An augmented reality (AR) device configured to generate hand annotations for an image depicting a hand of a user, the AR device comprising: a display; a processor; two or more image sensors; a memory storing instructions thereon, which, when executed by the processor, cause the AR device to perform operations comprising: performing a calibration operation to generate a three-dimensional (3-D) data model of the hand of the user based on measurements relating to the hand of the user, the calibration operation comprising: presenting, in AR via the display, an animated hand gesture; prompting the user to position and move a hand to correspond with the animated hand gesture; while the animated hand gesture is being presented, capturing one or more images of the hand of the user; and using the one or more images as input to a pre-trained hand pose estimation model to generate the 3-D data model of the hand of the user; generating, based on the 3-D model of the hand of the user, a 3-D virtual representation of a hand in a hand pose corresponding with a hand gesture different from the animated hand gesture; presenting, in AR via the display, the 3-D virtual representation of the hand in the hand pose corresponding with the hand gesture; during presentation of the 3-D virtual representation of the hand in the hand pose corresponding with the hand gesture in AR, detecting an input; responsive to detecting the input, invoking at least one image sensor to capture an image of the hand of the user positioned to correspond with the 3-D virtual representation of the hand in the hand pose corresponding with the hand gesture; and storing the captured image of the hand of the user with corresponding hand annotations based on the 3-D data model of the hand of the user.
2 . The AR device of claim 1 , where the pre-trained hand pose estimation model generates the 3-D data model of the hand as a skeleton-based hand mesh representation by computing a fixed number of mesh vertices to represent the hand of the user, based on 3-D location and rotation data derived from one or more captured images depicting the hand of the user.
3 . The AR device of claim 1 , wherein the 3-D virtual representation of the hand comprises: a synthetic hand mesh; a point cloud; or a skeletal model.
4 . The AR device of claim 1 , wherein generating the 3-D virtual representation of the hand in the hand pose corresponding with the hand gesture, based on the 3-D data model of the hand of the user, further comprises: mapping motion data of an animation of a hand performing a gesture to corresponding positional data for joints or bones, as represented by the 3-D data model of the hand of the user; adjusting joint angles of the 3-D data model of the hand according to the motion data of the animation; and rendering a plurality of two-dimensional (2-D) images of the 3-D virtual representation of the hand in the hand pose corresponding with the hand gesture, each 2-D image in the plurality of 2-D images representing a viewpoint of the 3-D virtual representation of the hand.
5 . The AR device of claim 1 , wherein presenting, in AR via the display, the 3-D virtual representation of the hand in the hand pose corresponding with the hand gesture further comprises; presenting the 3-D virtual representation of the hand fixed in AR space, thereby enabling the user to move around in physical space to observe the 3-D virtual representation of the hand from different angles and different perspectives.
6 . The AR device of claim 1 , wherein the corresponding hand annotations based on the 3-D data model of the hand of the user are hand annotations representing 3-D positional data for various points of the 3-D data model of the hand of the user, wherein the 3-D positional data of each point has been adjusted to reflect the hand pose corresponding with the hand gesture as depicted by the 3-D virtual representation of the hand when the image of the hand of the user was captured.
7 . The AR device of claim 1 , wherein the memory is storing additional instructions thereon, which, when executed by the processor, cause the AR device to perform additional operations comprising: prior to storing the captured image of the hand of the user with corresponding hand annotations based on the 3-D data model of the hand of the user, performing a mesh fitting operation to improve accuracy of 3-D positional data of various points of the 3-D data model of the hand, based on analysis of the image of the hand of the user.
8 . The AR device of claim 1 , wherein the detected input results from a triggering event comprising: an audible command; a hand gesture; a press of a button located on the AR device; or a press of a button located on a device communicatively coupled with the AR device.
9 . The AR device of claim 1 , wherein generating the 3-D virtual representation of a hand in the hand pose corresponding with the hand gesture comprises: selecting motion data for a particular hand gesture from a library of animation files, each animation file in the library containing pre-captured motion data for a different hand gesture; and using the selected motion data to generate the 3-D virtual representation of the hand in the hand pose corresponding with the particular hand gesture.
10 . A computer-implemented method comprising: performing a calibration operation to generate a three-dimensional (3-D) data model of a hand of a user based on measurements relating to the hand of the user, the calibration operation comprising: presenting in AR via a display, an animated hand gesture; prompting the user to position and move a hand to correspond with the animated hand gesture; while the animated hand gesture is being presented, capturing one or more images of the hand of the user; and using the one or more images as input to a pre-trained hand pose estimation model to generate the 3-D data model of the hand of the user; generating, based on the 3-D model of the hand of the user, a 3-D virtual representation of a hand in a hand pose corresponding with a hand gesture different from the animated hand gesture; presenting, in AR via a display, the 3-D virtual representation of the hand in the hand pose corresponding with the hand gesture; during presentation of the 3-D virtual representation of the hand in the hand pose corresponding with the hand gesture in AR, detecting an input; responsive to detecting the input, invoking at least one image sensor to capture an image of the hand of the user positioned to correspond with the 3-D virtual representation of the hand in the hand pose corresponding with the hand gesture; and storing the captured image of the hand of the user with corresponding hand annotations based on the 3-D data model of the hand of the user.
11 . The computer-implemented method of claim 10 , where the pre-trained hand pose estimation model generates the 3-D data model of the hand as a skeleton-based hand mesh representation by computing a fixed number of mesh vertices to represent the hand of the user, based on 3-D location and rotation data derived from one or more captured images depicting the hand of the user.
12 . The computer-implemented method of claim 10 , wherein the 3-D virtual representation of the hand comprises: a synthetic hand mesh; a point cloud; or a skeletal model.
13 . The computer-implemented method of claim 10 , wherein generating the 3-D virtual representation of the hand in the hand pose corresponding with the hand gesture, based on the 3-D data model of the hand of the user, further comprises: mapping motion data of an animation of a hand performing a gesture to corresponding positional data for joints or bones, as represented by the 3-D data model of the hand of the user; adjusting joint angles of the 3-D data model of the hand according to the motion data of the animation; and rendering a plurality of two-dimensional (2-D) images of the 3-D virtual representation of the hand in the hand pose corresponding with the hand gesture, each 2-D image in the plurality of 2-D images representing a viewpoint of the 3-D virtual representation of the hand.
14 . The computer-implemented method of claim 10 , wherein presenting, in AR via the display, the 3-D virtual representation of the hand in the hand pose corresponding with the hand gesture further comprises; presenting the 3-D virtual representation of the hand fixed in AR space, thereby enabling the user to move around in physical space to observe the 3-D virtual representation of the hand from different angles and different perspectives.
15 . The computer-implemented method of claim 10 , wherein the corresponding hand annotations based on the 3-D data model of the hand of the user are hand annotations representing 3-D positional data for various points of the 3-D data model of the hand of the user, wherein the 3-D positional data of each point has been adjusted to reflect the hand pose corresponding with the hand gesture as depicted by the 3-D virtual representation of the hand when the image of the hand of the user was captured.
16 . The computer-implemented method of claim 10 , further comprising: prior to storing the captured image of the hand of the user with corresponding hand annotations based on the 3-D data model of the hand of the user, performing a mesh fitting operation to improve accuracy of 3-D positional data of various points of the 3-D data model of the hand, based on analysis of the image of the hand of the user.
17 . The computer-implemented method of claim 10 , wherein the detected input results from a triggering event comprising: an audible command; a hand gesture; a press of a button located on an AR device; or a press of a button located on a device communicatively coupled with the AR device.
18 . The computer-implemented method of claim 10 , wherein generating the 3-D virtual representation of a hand in the hand pose corresponding with the hand gesture comprises: selecting motion data for a particular hand gesture from a library of animation files, each animation file in the library containing pre-captured motion data for a different hand gesture; and using the selected motion data to generate the 3-D virtual representation of the hand in the hand pose corresponding with the particular hand gesture.
19 . An AR device configured to generate hand annotations for an image depicting a hand of a user, the AR device comprising: means for performing a calibration operation to generate a 3-D data model of the hand of the user based on measurements relating to the hand of the user, the calibration operation comprising: presenting, in AR via a display, an animated hand gesture; prompting the user to position and move a hand to correspond with the animated hand gesture; while the animated hand gesture is being presented, capturing one or more images of the hand of the user; and using the one or more images as input to a pre-trained hand pose estimation model to generate the 3-D data model of the hand of the user; means for generating, based on the 3-D model of the hand of the user, a 3-D virtual representation of a hand in a hand pose corresponding with a hand gesture different from the animated hand gesture; means for presenting, in AR, the 3-D virtual representation of the hand in the hand pose corresponding with the hand gesture; during presentation of the 3-D virtual representation of the hand in the hand pose corresponding with the hand gesture in AR, means for detecting an input; responsive to detecting the input, means for invoking at least one image sensor to capture an image of the hand of the user positioned to correspond with the 3-D virtual representation of the hand in the hand pose corresponding with the hand gesture; and means for storing the captured image of the hand of the user with corresponding hand annotations based on the 3-D data model of the hand of the user.

Description

TECHNICAL FIELD The present application relates to the field of computer vision systems. More specifically, the subject matter of the present application relates to an automated technique for generating hand-related data annotations for images depicting various hand poses. BACKGROUND Hand-related data annotations, sometimes referred to more simply as hand annotations, play a crucial role in computer vision systems, augmented reality (AR), and gesture recognition systems. These systems rely on accurate and detailed information about hand poses, hand movements, and hand gestures to provide immersive user experiences, enable natural user interfaces, and enhance human-computer interaction. Hand annotations are used as training data to train machine learning algorithms, enabling these systems to understand and interpret hand-related information in real-time. Generating hand annotations involves identifying and labeling landmarks, specific points or marks on a hand (e.g., an image of a hand), in order to provide detailed information about its pose, movements, or other characteristics. In a hand model, the number of points or marks used can vary depending on the level of detail required and the specific application. However, a common approach is to use a set of key landmark points that capture the essential features and articulations of the hand. These landmark points typically include: Fingertips: Points located at the tips of each finger, representing the positions of the fingertips.Finger Joints: Points along the length of each finger, representing the joint positions, such as the metacarpophalangeal (MCP), proximal interphalangeal (PIP), and distal interphalangeal (DIP) joints.Knuckles: Points located at the joints where the fingers meet the hand, capturing the positions of the knuckles.Palm Center: A point at the center of the palm, representing the overall position of the hand.Wrist: A point at the base of the hand, indicating the position of the wrist. Each point or mark in the hand model is associated with specific information that helps describe the hand's pose or movements. This information typically includes three-dimensional (3-D) positional data or coordinates (e.g., X, Y and Z coordinates) of each point or mark in a 3-D space, defining the position of the point or mark relative to a reference point. These hand annotations, which collectively represent a 3-D model of a hand, provide valuable data for various applications. For instance, in hand gesture recognition, the positions of the fingertips and other landmark points can be used to classify and interpret different gestures. In AR, the hand annotations help track and overlay virtual objects onto the user's hand accurately. By obtaining images with hand-related data annotations for specific points or marks, it becomes possible to analyze and understand the intricacies of hand poses, and thus hand movements and hand gestures, enabling a wide range of applications that rely on accurate hand tracking and interpretation. BRIEF DESCRIPTION OF THE DRAWINGS In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. To easily identify the discussion of any particular element or operation, the most significant digit or digits in a reference number refer to the figure number in which that element is first introduced. Some non-limiting examples are illustrated in the figures of the accompanying drawings in which: FIG. 1 is a flow diagram illustrating an example of method operations involved in a technique for using an augmented reality (AR) device, such as AR glasses, to generate images of a hand in a hand pose, where each image is associated with corresponding hand-related data annotations, consistent with some examples. FIG. 2 is a diagram illustrating an example of a calibration operation, during which a user is prompted to position and move his or her hand to replicate a hand gesture that is presented via an animation of a 3-D virtual representation of a hand performing the gesture, in AR, according to some examples. FIG. 3 is a diagram illustrating an example of a technique for generating a 3-D virtual representation of a hand in a hand pose that is associated with a hand gesture, where the virtual representation of the hand is generated to be of a size that corresponds with a 3-D model of the hand of the user, according to some examples. FIG. 4 is a user interface diagram showing an example of a 3-D virtual representation of a hand in a hand pose associated with a hand gesture, as presented in AR space, according to some examples. FIG. 5 is a user interface diagram showing a user attempting to replicate a hand pose with his or her hand, where the hand pose is presented by the 3-D virtual representation of the hand presented in AR space, according to some examples. FIG. 6 is a user interface diagram showing two-dimensional (2-D) landmarks mapped to an image of a hand of a user, according to