US-12620188-B2 - Avatar generation from digital media content items

US12620188B2US 12620188 B2US12620188 B2US 12620188B2US-12620188-B2

Abstract

A system for generating avatars from user self-images is disclosed, whereby the system accesses a media content item of a user that includes a face of the user, analyzes data associated with the media content item using a first machine learning model to generate a first modified media content item, parses a portion of the first modified media content item corresponding to the face of the user, and analyzes data associated with the portion of the first modified media content item using a second machine learning model to generate a digital avatar for the user.

Inventors

Arnab Ghosh
Sergei Gorbatiuk
Pavel Savchenkov
Sergey Smetanin

Assignees

SNAP INC.

Dates

Publication Date: 20260505
Application Date: 20230705

Claims (19)

1 . A system comprising: at least one processor; and at least one memory component storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising: accessing a media content item of a user that includes a face of the user; analyzing data associated with the media content item using a first machine learning model to generate a first modified media content item; parsing a portion of the first modified media content item corresponding to the face of the user; analyzing data associated with the portion of the first modified media content item using a second machine learning model to generate a digital avatar for the user; and removing one or more artifacts of the digital avatar by analyzing the media content item and the digital avatar using a third machine learning model to receive a third modified media content item, wherein the third machine learning model is trained to compare media content items and modified media content items to remove artifacts in the modified media content items.
2 . The system of claim 1 , wherein the operations further comprise identifying a prompt of the user indicating an intent for the media content item, wherein analyzing data using the first or second machine learning model further comprises processing data associated with the identified prompt.
3 . The system of claim 2 , wherein identifying the prompt for the user comprises receiving a question or request from the user via text or speech.
4 . The system of claim 2 , wherein identifying the prompt for the user comprises automatically generating the prompt based on an intent identified from real-time interaction data captured by an interaction client of the user.
5 . The system of claim 2 , wherein the operations further comprise identifying keywords from the prompt and applying weights to each of the identified keywords, wherein analyzing the data comprises applying the identified keywords and corresponding weights to the first or second machine learning model.
6 . The system of claim 1 , wherein the first machine learning model is trained to maintain one or more first facial features or reduce an amount of modification to the one or more first facial features, while modifying one or more second facial features.
7 . The system of claim 6 , wherein the second machine learning model is trained to modify the one or more first facial features.
8 . The system of claim 1 , wherein the digital avatar includes a modified face of the user in a same pose as the face of the user in the media content item.
9 . The system of claim 1 , wherein removing the one or more artifacts of the digital avatar is based on a comparison between the face in the media content item and the face in the digital avatar.
10 . The system of claim 1 , wherein parsing the portion of the first modified media content item corresponding to the face of the user includes parsing hair from the face, wherein the digital avatar includes hair for a face in the digital avatar generated by the second machine learning model.
11 . The system of claim 1 , wherein the operations further comprise: causing display of a first selectable user interface element associated with the digital avatar; and in response to a user selection of the first selectable user interface element: applying a first content augmentation of the digital avatar to a camera feed from a camera system; and displaying the camera feed with the applied first content augmentation on a user interface for the user.
12 . The system of claim 11 , wherein the operations further comprise: displaying a second selectable user interface element; and in response to a user selection of the second selectable user interface element: capturing a picture or video of the camera feed with the applied first content augmentation; displaying a third selectable user interface element; and in response to a user selection of the third selectable user interface element, transmitting the captured picture or video to a second user.
13 . The system of claim 1 , wherein parsing the portion of the first modified media content item corresponding to the face of the user comprises extracting facial features from the first modified media content item.
14 . The system of claim 1 , wherein the first machine learning model is trained using unsupervised learning to generate first modified media content items based on original media content items, wherein the first machine learning model is trained to map facial landmarks on input media content items to landmarks of certain modifications to generate the modified media content items.
15 . The system of claim 14 , wherein the second machine learning model is trained using supervised learning to generate digital avatars based on portions of the first modified media content items.
16 . The system of claim 1 , wherein the first machine learning model is trained to apply a discriminator network that takes as input media content items and modified media content items and outputs a determination whether the modified media content item is a real or fake image.
17 . A method comprising: accessing a media content item of a user that includes a face of the user; analyzing data associated with the media content item using a first machine learning model to generate a first modified media content item, wherein the first machine learning model is trained using unsupervised learning to generate first modified media content items based on original media content items, wherein the first machine learning model is trained to map facial landmarks on input media content items to landmarks of certain modifications to generate the modified media content items; parsing a portion of the first modified media content item corresponding to the face of the user; and analyzing data associated with the portion of the first modified media content item using a second machine learning model to generate a digital avatar for the user, wherein the second machine learning model is trained to apply noise to input media content items and then to remove the noise inputted into the media content items to generate modified media content items, wherein the second machine learning model is trained using supervised learning to generate digital avatars based on portions of the first modified media content items.
18 . The method of claim 17 , further comprising removing the one or more artifacts of the digital avatar is based on a comparison between the face in the media content item and the face in the digital avatar.
19 . A non-transitory computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform operations comprising: accessing a media content item of a user that includes a face of the user; analyzing data associated with the media content item using a first machine learning model to generate a first modified media content item; parsing a portion of the first modified media content item corresponding to the face of the user; analyzing data associated with the portion of the first modified media content item using a second machine learning model to generate a digital avatar for the user; and training the second machine learning model by: identifying modified media content items and expected media content items; applying the modified media content items to receive output media content items; comparing the output media content items with the expected media content items to determine a loss function for the second machine learning model; and updating one or more parameters of the second machine learning model based on the loss function.

Description

CLAIM OF PRIORITY This application claims the benefit of priority to U.S. Provisional Application Ser. No. 63/504,984, filed on May 30, 2023, which is incorporated herein by reference in its entirety. TECHNICAL FIELD The present disclosure relates generally to avatars, and more specifically to the generation of avatars from digital media content items. BACKGROUND Avatars have gained popularity in recent years due to several factors that cater to the evolving needs of users in the digital world. Avatars enable users to create a digital representation of themselves, offering a unique and customized presence in online spaces. This personal touch allows users to express their identity and personality in a way that static images or text cannot. Also, by representing themselves through an avatar, users can protect their real-life identity while still engaging with others in a meaningful way. Moreover, immersive experiences, such as virtual reality (VR) and augmented reality (AR) technologies, has driven demand for avatars that can interact in these environments. Their increasing adoption in various applications and platforms reflects the growing demand for more engaging and immersive digital experiences. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. To identify the discussion of any particular element or act more easily, the most significant digit or digits in a reference number refer to the figure number in which that element is first introduced. Some non-limiting examples are illustrated in the figures of the accompanying drawings in which: FIG. 1 is a diagrammatic representation of a networked environment in which the present disclosure may be deployed, according to some examples. FIG. 2 is a diagrammatic representation of an interaction system that has both client-side and server-side functionality, according to some examples. FIG. 3 is a diagrammatic representation of a data structure as maintained in a database, according to some examples. FIG. 4 illustrates an example flowchart for generating a modified self-image of a user, according to some examples. FIG. 5 illustrates an example for generating an avatar, according to some examples. FIG. 6 illustrates the generation of stylized media content items, according to some examples. FIG. 7 illustrates generation of avatars for the user, according to some examples. FIG. 8 is a diagrammatic representation of a message, according to some examples. FIG. 9 illustrates a system including a head-wearable apparatus with a selector input device, according to some examples. FIG. 10 is a diagrammatic representation of a machine in the form of a computer system within which a set of instructions may be executed to cause the machine to perform any one or more of the methodologies discussed herein, according to some examples. FIG. 11 is a block diagram showing a software architecture within which examples may be implemented. FIG. 12 illustrates a machine-learning pipeline, according to some examples. FIG. 13 illustrates training and use of a machine-learning program, according to some examples. DETAILED DESCRIPTION Avatars are widely used in various applications to provide a more personalized and engaging user experience. For example, avatars are used as profile pictures and can be customized to resemble users' appearances, with a range of facial features, hairstyles, and clothing options. Moreover, users can create their own personalized stickers using avatars, offering more engaging communication options. Players can also customize their in-game avatars with various skins, emotes, and accessories, enabling them to express their personalities in the virtual world. Content augmentations and filters are also used by users to transform their faces into avatars, creating fun and interactive multimedia experiences. However, traditional methods and systems for avatar generation pose significant challenges. These systems train machine learning models in real-time using received self-images inputted by a user. Real-time training fine-tunes the machine learning model to generate modified self-images custom tailored to the user. However, such real-time training is time-consuming, especially if the models are complex and the datasets are large. This can lead to latency issues affecting user experience, especially in applications where immediate response is expected. Another challenge is the preservation of the user's identity in the generated avatar. The models are trained in real-time based on a limited set of user inputted self-images, causing the traditional systems to not accurately capture the unique features of the user's face, making the user unrecognizable in the avatar. Real-time training of models is also resource-intensive, requiring high computational power, which could limit the accessibility of such features to users with less powerful dev