US-12626434-B2 - Facial image swapping

US12626434B2US 12626434 B2US12626434 B2US 12626434B2US-12626434-B2

Abstract

Techniques for generating modified images using facial content information are disclosed. First image data comprising first facial content information is received, and a facial content encoder generates a first embedding by extracting the first facial content information from the first image data. Second image data comprising second facial content information and non-facial content information (e.g., style information, pose, facial expression) is received, and a non-facial content encoder generates a second embedding comprising the non-facial content information. A decoder generates a modified image using the first embedding and the second embedding, the modified image comprising the first facial content information of the first image data and the non-facial content information of the second image data.

Inventors

Santiago Iglesias Navarro
Pablo Pernías Pascual de Pobil
Robert B. Moore
David N. Juboor

Assignees

DISNEY ENTERPRISES, INC.

Dates

Publication Date: 20260512
Application Date: 20240129

Claims (20)

1 . A computing system comprising: at least one processor; and at least one non-transitory memory carrying instructions that, when executed by the at least one processor, cause the computing system to perform operations comprising: receive first image data comprising first facial content information; generate, using a facial content encoder, a facial content embedding comprising the first facial content information, wherein the facial content encoder extracts the first facial content information from the first image data to generate the facial content embedding; receive second image data comprising second facial content information and non-facial content information; generate, using a non-facial content encoder, a non-facial content embedding comprising the non-facial content information; generate, by a decoder, a modified image using the facial content embedding and the non-facial content embedding, wherein the modified image comprises the first facial content information of the first image data and the non-facial content information of the second image data.
2 . The computing system of claim 1 , wherein the first facial content information comprises shapes or dimensions of a set of facial features in the first image data.
3 . The computing system of claim 1 , wherein the non-facial content information comprises style information of the second image data, background information of the second image data, and a facial pose or facial expression of the second facial content, and wherein the style information comprises color information and texture information.
4 . The computing system of claim 3 , wherein the style information is associated with an animation style or an artistic style or technique.
5 . The computing system of claim 1 , wherein the facial content encoder, the non-facial content encoder, and the decoder are included in a machine-learned (ML) model, and wherein the operations further comprise: receive a plurality of image data comprising a plurality of facial content information and a plurality of non-facial content information; generate a training dataset using the received plurality of image data, wherein the received plurality of image data is pre-processed by cropping and aligning the plurality of facial content information; and train the ML model using the generated training dataset, wherein the training comprises determining a set of loss functions and corresponding weights based on the loss functions.
6 . The computing system of claim 5 , wherein the operations further comprise: evaluate accuracy of the ML model using a testing dataset comprising at least a portion of the training dataset; and retrain the ML model when the accuracy does not exceed a threshold accuracy, wherein the retraining comprises adjusting a set of weights or training the ML model using a different training dataset.
7 . A non-transitory computer-readable medium carrying instructions that, when executed by a computing system, cause the computing system to perform operations comprising: receive first image data comprising first facial content information; generate, using a facial content encoder, a facial content embedding comprising the first facial content information, wherein the facial content encoder extracts the first facial content information from the first image data to generate the facial content embedding; receive second image data comprising second facial content information and non-facial content information; generate, using a non-facial content encoder, a non-facial content embedding comprising the non-facial content information; generate, by a decoder, a modified image using the facial content embedding and the non-facial content embedding, wherein the modified image comprises the first facial content information of the first image data and the non-facial content information of the second image data.
8 . The non-transitory computer-readable medium of claim 7 , wherein the first facial content information comprises shapes or dimensions of a set of facial features in the first image data.
9 . The non-transitory computer-readable medium of claim 7 , wherein the non-facial content information comprises style information of the second image data, background information of the second image data, and a facial pose or facial expression of the second facial content information, and wherein the style information comprises color information and texture information.
10 . The non-transitory computer-readable medium of claim 9 , wherein the style information is associated with an animation style or an artistic style or technique.
11 . The non-transitory computer-readable medium of claim 7 , wherein the facial content encoder, the non-facial content encoder, and the decoder are included in a machine-learned (ML) model.
12 . The non-transitory computer-readable medium of claim 11 , further comprising: receive a plurality of image data comprising a plurality of facial content information and a plurality of non-facial content information; generate a training dataset using the received plurality of image data, wherein the received plurality of image data is pre-processed by cropping and aligning the plurality of facial content information; and train the ML model using the generated training dataset, wherein the training comprises determining a set of loss functions and corresponding weights based on the loss functions.
13 . The non-transitory computer-readable medium of claim 12 , further comprising: evaluate accuracy of the ML model using a testing dataset comprising at least a portion of the training dataset; and retrain the ML model when the accuracy does not exceed a threshold accuracy, wherein the retraining comprises adjusting a set of weights or training the ML model using a different training dataset.
14 . A computer-implemented method of generating modified images using facial content information, the method comprising: receiving first image data comprising first facial content information; generating, using a facial content encoder, a first embedding comprising the first facial content information, wherein the facial content encoder extracts the first facial content information from the first image data to generate the first embedding; receiving second image data comprising second facial content information and non-facial content information; generating, using a non-facial content encoder, a second embedding comprising the non-facial content information; generating, by a decoder, a modified image using the first embedding and the second embedding, wherein the modified image comprises the first facial content information of the first image data and the non-facial content information of the second image data.
15 . The computer-implemented method of claim 14 , wherein the first facial content information comprises shapes or dimensions of a set of facial features in the first image data.
16 . The computer-implemented method of claim 14 , wherein the non-facial content information comprises style information of the second image data, background information of the second image data, and a facial pose or facial expression of the second facial content, and wherein the style information comprises color information and texture information.
17 . The computer-implemented method of claim 16 , wherein the style information is associated with an animation style or an artistic style or technique.
18 . The computer-implemented method of claim 14 , wherein the facial content encoder, the non-facial content encoder, and the decoder are included in a machine-learned (ML) model.
19 . The computer-implemented method of claim 18 , further comprising: receiving a plurality of image data comprising a plurality of facial content information and a plurality of non-facial content information; generating a training dataset using the received plurality of image data, wherein the received plurality of image data is pre-processed by cropping and aligning the plurality of facial content information; and training the ML model using the generated training dataset, wherein the training comprises determining a set of loss functions and corresponding weights based on the loss functions.
20 . The computer-implemented method of claim 19 , further comprising: evaluating accuracy of the ML model using a testing dataset comprising at least a portion of the training dataset; and retraining the ML model when the accuracy does not exceed a threshold accuracy, wherein the retraining comprises adjusting a set of weights or training the ML model using a different training dataset.

Description

CROSS REFERENCE TO RELATED APPLICATION This application is related to the Applicant's concurrently filed application titled “Image Style Transfer,” which is incorporated herein by reference in its entirety for all purposes. FIELD Described embodiments relate generally to generating modified images, such as modifying facial information in an image. BACKGROUND Digital images can be modified in various ways to generate modified images. For example, images can be digitally manipulated to add or remove content or to replace a person's likeness with that of a different person. Modified images can also be generated to combine characteristics or content of images. Image manipulations can be applied manually or using various algorithms. Current techniques may have only limited functionality to transfer selected information, e.g., facial information, or combine information from different images in a fast and accurate manner. SUMMARY The following Summary is for illustrative purposes only and does not limit the scope of the technology disclosed in this document. In an embodiment, a computer-implemented method of generating modified images using facial content is disclosed. First image data is received including first facial content information. The first facial content information can include shapes or dimensions of a set of facial features in the first image data. A first embedding (e.g., a facial content embedding) is generated using a facial content encoder, the embedding including the first facial content information. To generate the embedding, the facial content encoder extracts the first facial content information from the first image data. Second image data is received including second facial content information and non-facial content information. The non-facial content information can include style information, a pose or facial expression of the second facial content information, background information (e.g., background content or style), color information, texture information, or the like. The style information can include an artistic style, an animation style, or the like. A second embedding (e.g., a non-facial content embedding) is generated using a non-facial content encoder, the second embedding including the non-facial content information. A modified image is generated by a decoder using the first embedding and the second embedding, the modified image including the first facial content information of the first image data and the non-facial content information of the second image data. In various embodiments, the facial content encoder, the non-facial content encoder, and the decoder are included in a machine-learned (“ML”) model. In these and other embodiments, the method further includes receiving a plurality of image data including a plurality of facial content information and a plurality of non-facial content information, generating a training dataset using the received plurality of image data, the received plurality of image data being pre-processed by cropping and aligning the plurality of facial content information, and training the ML model using the generated training dataset, the training including determining a set of loss functions and corresponding weights based on the loss functions. In various embodiments, the method further includes evaluating accuracy of the ML model using a testing dataset comprising at least a portion of the training dataset, and retraining the ML model when the accuracy does not exceed a threshold accuracy, the retraining including adjusting a set of weights or training the ML model using a different training dataset. In another embodiment, a system is disclosed including one or more processors and one or more memories carrying instructions configured to cause the one or more processors to perform the foregoing methods. In yet another embodiment, a computer-readable medium is disclosed carrying instructions configured to cause one or more computing systems or one or more processors to perform the foregoing methods. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram illustrating a system flow for image content swapping. FIG. 2 is a block diagram illustrating a system flow for training a machine-learned model for facial image swapping. FIG. 3 is a flow diagram illustrating a process performed using a facial image swapping system. FIG. 4 is a block diagram illustrating a computing device for implementing a facial image swapping system. DETAILED DESCRIPTION Conventional techniques to modify images may use manual processes or simple algorithms that may provide limited functionality to transfer content or style information between images. For example, while conventional techniques may allow facial content of a first image to be modified using facial content of a second image, such techniques are typically inefficient or do not provide satisfactory results, e.g., many are done by “cut and paste” techniques manually selected by a user. Conventional techniques may also