US-20260127831-A1 - PIXEL-BASED MULTI-VIEW GARMENT TRANSFER

US20260127831A1US 20260127831 A1US20260127831 A1US 20260127831A1US-20260127831-A1

Abstract

Methods and systems are disclosed for using machine learning models to perform pixel-based deformation of fashion items. The methods and systems receive one or more images depicting a person in an individual pose and receive a first source image depicting a first view of a target fashion item and a second source image depicting a second view of the target fashion item. The methods and systems process, using one or more machine learning models, the one or more images that depict the person in the individual pose together with the first and second source images to generate a flow field, the flow field indicating a likelihood of existence and location of each pixel of the one or more images relative to the first and second source images. The methods and systems modify a portion of the one or more images to overlay the target fashion item on the person.

Inventors

Avihay Assouline
AMIR FRUCHTMAN
Jonathan Heimann
Nir Malbin

Assignees

SNAP INC.

Dates

Publication Date: 20260507
Application Date: 20251230

Claims (20)

1 . A method comprising: processing, using one or more machine learning models, one or more images that depict an object in an individual pose together with first and second source images of a target to generate a flow field, the flow field indicating a likelihood of: existence and location of each pixel of the one or more images relative to the first and second source images; and modifying, based on the flow field, a portion of the one or more images to overlay the target on the object in the individual pose including a first portion of the target depicted in the first source image and a second portion of the target depicted in the second source image.
2 . The method of claim 1 , further comprising: applying a pose estimation machine learning model to the one or more images to generate first pose estimation information representing the individual pose of the object; and applying the pose estimation machine learning model to the first and second source images to generate respectively second and third pose estimation information representing the poses of the target depicted in the first and second source images.
3 . The method of claim 2 , wherein the first pose estimation information comprises an identification of parts of the object that are depicted in the one or more images.
4 . The method of claim 2 , wherein the second pose estimation information comprises an identification of parts associated with a first view, and the third pose estimation information comprises an identification of parts associated with a second view.
5 . The method of claim 2 , further comprising: processing, by a flow estimation machine learning model, the first pose estimation information, the second pose estimation information, and the third pose estimation information, to generate the flow field indicating the likelihood of existence and the location of each pixel of the one or more images relative to the first and second source images.
6 . The method of claim 5 , wherein the flow field indicates that a first pixel of the one or more images is associated with a first likelihood of existence in the first source image and the location of the first pixel, and wherein the flow field indicates that the first pixel of the one or more images is associated with a second likelihood of existence in the second source image and the location of the first pixel.
7 . The method of claim 6 , further comprising: determining that the first likelihood is greater than the second likelihood; and in response to determining that the first likelihood is greater than the second likelihood, replacing the first pixel with a pixel value associated with the location of the first pixel in the first source image.
8 . The method of claim 6 , further comprising: determining that the first likelihood is lower than the second likelihood; and in response to determining that the first likelihood is lower than the second likelihood, replacing the first pixel with a pixel value associated with the location of the first pixel in the second source image.
9 . The method of claim 6 , further comprising: determining that the first and second likelihoods fail to transgress a minimum likelihood of existence threshold; and in response to determining that the first and second likelihoods fail to transgress the minimum likelihood of existence threshold, applying a generative machine learning model to the one or more images, the first source image, and the second source image to generate a pixel value for the first pixel.
10 . The method of claim 9 , further comprising: replacing the first pixel with the pixel value generated by the generative machine learning model; and replacing a first set of pixels of the one or more images with a first portion of the first source image and a second set of pixels of the one or more images with a second portion of the second source image.
11 . The method of claim 1 , further comprising: generating a first confidence map for the first source image that indicates the likelihood of existence and location of each pixel of the one or more images in the first source image; and generating a second confidence map for the second source image that indicates the likelihood of existence and location of each pixel of the one or more images in the second source image.
12 . The method of claim 11 , further comprising: selectively replacing individual pixels in the one or more images with either pixels of the first source image or pixels of the second source image based on the first and second confidence maps.
13 . The method of claim 1 , further comprising: sampling one or more pixels of the first and second source images based on the flow field to extract and adjust a pose of the target to match the individual pose of the object depicted in the one or more images.
14 . The method of claim 1 , wherein the one or more machine learning models comprise a convolutional neural network associated with an extended reality (XR) experience.
15 . The method of claim 14 , wherein the one or more machine learning models are trained by performing training operations comprising: accessing training data comprising a first training image depicting a first training object in a training pose, a second training image depicting a first training view of a second training object, a third training image depicting a second training view of a second training object, and a ground truth flow field for the second and third training images; analyzing, using the one or more machine learning models, the first, second, and third training images to estimate a flow field for the second and third training images; computing a loss based on a deviation between the estimated flow field for the second and third training images and the ground truth flow field; and updating one or more parameters of the one or more machine learning models based on the computed loss.
16 . The method of claim 15 , further comprising: repeating the training operations for additional training data until a stopping criterion is met, wherein the ground truth flow field indicates whether a part depicted in the first training object exists in the second or third training images.
17 . The method of claim 1 , wherein the target is a virtual object.
18 . The method of claim 1 , wherein the target is a real-world object.
19 . A system comprising: at least one processor; and at least one memory component having instructions stored thereon that, when executed by the at least one processor, cause the at least one processor to perform operations comprising: processing, using one or more machine learning models, one or more images that depict an object in an individual pose together with first and second source images of a target to generate a flow field, the flow field indicating a likelihood of: existence and location of each pixel of the one or more images relative to the first and second source images; and modifying, based on the flow field, a portion of the one or more images to overlay the target on the object in the individual pose including a first portion of the target depicted in the first source image and a second portion of the target depicted in the second source image.
20 . A non-transitory computer-readable storage medium having stored thereon instructions that, when executed by at least one processor, cause the at least one processor to perform operations comprising: processing, using one or more machine learning models, one or more images that depict an object in an individual pose together with first and second source images of a target to generate a flow field, the flow field indicating a likelihood of: existence and location of each pixel of the one or more images relative to the first and second source images; and modifying, based on the flow field, a portion of the one or more images to overlay the target on the object in the individual pose including a first portion of the target depicted in the first source image and a second portion of the target depicted in the second source image.

Description

CLAIM OF PRIORITY This application is a continuation of U.S. patent application Ser. No. 18/399,285, filed on Dec. 28, 2023, which is incorporated herein by reference in its entirety. TECHNICAL FIELD The present disclosure relates generally to processing images and videos using machine learning (ML) models, such as for performing augmented reality (AR) operations. BACKGROUND AR is a modification of a virtual environment. For example, in virtual reality (VR), a user is completely immersed in a virtual world, whereas in AR, the user is immersed in a world where virtual objects are combined or superimposed on the real world. An AR system aims to generate and present virtual objects that interact realistically with a real-world environment and with each other. Examples of AR applications can include single or multiple player video games, instant messaging systems, and the like. In general, these systems are referred to as extended reality (XR) systems. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. To easily identify the discussion of any particular element or act, the most significant digit or digits in a reference number refer to the figure number in which that element is first introduced. Some non-limiting examples are illustrated in the figures of the accompanying drawings in which: FIG. 1 is a diagrammatic representation of a networked environment in which the present disclosure may be deployed, according to some examples. FIG. 2 is a diagrammatic representation of a messaging system that has both client-side and server-side functionality, according to some examples. FIG. 3 is a diagrammatic representation of a data structure as maintained in a database, according to some examples. FIG. 4 is a diagrammatic representation of a message, according to some examples. FIG. 5 is a diagrammatic representation of a pixel-based deformation system, according to some examples. FIG. 6 is a diagrammatic representation of example inputs and outputs of the pixel-based deformation system, according to some examples. FIG. 7 is a flowchart illustrating example operations and methods of the pixel-based deformation system, according to some examples. FIG. 8 is a diagrammatic representation of a machine in the form of a computer system within which a set of instructions may be executed to cause the machine to perform any one or more of the methodologies discussed herein, according to some examples. FIG. 9 is a block diagram showing a software architecture within which examples may be implemented. FIG. 10 illustrates a system in which a head-wearable apparatus may be implemented, according to some examples. DETAILED DESCRIPTION The description that follows includes systems, methods, techniques, instruction sequences, and computing machine program products that embody illustrative examples of the disclosure. In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide an understanding of various examples. It will be evident, however, to those skilled in the art, that examples may be practiced without these specific details. In general, well-known instruction instances, protocols, structures, and techniques are not necessarily shown in detail. Typically, communication platforms allow users to share content and create images for transmission to other users. These images can be used to promote products or services and/or to simply represent different real-world objects in simulated or real environments. However, these systems require a user to use expensive equipment and technology to create high-quality, appealing images. Also, users may spend a great deal of effort meticulously placing objects in different environments and manually adjusting lighting and other image attributes to enhance the presentation of the objects in the images. All of these factors can add up to make the creation of high-quality images (e.g., for use in advertising) a significant expense and detract from the overall use and enjoyment of the system. In addition, because users may not have the resources needed to create high-quality images, opportunities to share and present objects in ideal settings are missed. Also, presenting lower quality images of such objects can cause other users to overlook the value of the objects, which wastes the resources used to create and display the objects. In some cases, machine learning models are applied to generate XR experiences in which the above images/videos are created and shared. For example, the machine learning models can receive an image, such as from a camera, and can also receive an image depicting a target fashion item. The machine learning models can be trained to generate a new image in which the image depicting a person wearing one fashion item is modified to have the person wearing the target fashion item. These mach