US-20260127799-A1 - TECHNIQUES FOR GENERATING DUBBED MEDIA CONTENT ITEMS

US20260127799A1US 20260127799 A1US20260127799 A1US 20260127799A1US-20260127799-A1

Abstract

In various embodiments, a dubbing application performs three-dimensional (3D) tracking of (1) the face of an actor within video frames of a first media content item to generate 3D geometry representing the face of the actor, and (2) the face of a dubber within video frames of a second media content item to generate 3D geometry representing the face of the dubber. The dubbing application also tracks the texture and lighting of the face of the actor in the first media content item. The dubbing application aligns the 3D geometry of the face of the dubber with the 3D geometry of the face of the actor. Then, the dubbing application performs neural rendering to generate dubbed video frames using a trained machine learning model, the aligned 3D geometry of the dubber, the texture and lighting of the face of the actor, and the video frames of the first media content.

Inventors

Chao Pan
Yiwei Zhao

Assignees

NETFLIX, INC.

Dates

Publication Date: 20260507
Application Date: 20251230

Claims (20)

1 . A computer-implemented method for generating a dubbed media content item, the method comprising: generating first three-dimensional (3D) geometry, a texture map, and a lighting map based on a face of an actor included in a first video frame of a first media content item; generating second 3D geometry of another face based on audio associated with a dubber included in a second media content item; performing one or more operations that align the second 3D geometry with the first 3D geometry to generate an aligned second 3D geometry; and performing one or more operations, via one or more machine learning models, to render a second video frame based on the aligned second 3D geometry, the texture map, the lighting map, and the first video frame of the first media content item.
2 . The computer-implemented method of claim 1 , wherein generating the second 3D geometry comprises performing one or more operations to process the audio associated with the dubber using another trained machine learning model that outputs the second 3D geometry.
3 . The computer-implemented method of claim 2 , wherein the another trained machine learning model comprises a sequential decoder.
4 . The computer-implemented method of claim 2 , wherein the another trained machine learning model comprises an encoder that encodes the audio associated with the dubber into an embedding in an expression space and a decoder that decodes the embedding into the second 3D geometry.
5 . The computer-implemented method of claim 2 , further comprising performing one or more autoregressive operations to train a machine learning model to generate the another trained machine learning model.
6 . The computer-implemented method of claim 2 , further comprising performing one or more operations to train a machine learning model to generate the another trained machine learning model based on audio from one or more media content items, wherein the audio from the one or more media content items includes speech in at least two different languages.
7 . The computer-implemented method of claim 2 , further comprising performing one or more operations to re-train a previously trained machine learning model based on audio from one or more media content items to generate the another trained machine learning model.
8 . The computer-implemented method of claim 1 , wherein generating the first 3D geometry comprises: detecting a plurality of landmarks on the face of the actor in the first video frame; performing one or more operations to fit an intermediate 3D geometry based on the plurality of landmarks, wherein the intermediate 3D geometry is defined using one or more parameters; and performing one or more operations to modify the one or more parameters of the intermediate 3D geometry based on the first video frame and a loss function.
9 . The computer-implemented method of claim 1 , wherein performing the one or more operations that align the second 3D geometry with the first 3D geometry comprises: performing one or more operations to align a nose position and a mouth position of the second 3D geometry with a nose position and a mouth position of the first 3D geometry; performing one or more operations to equalize a scale of one or more expressions of the second 3D geometry with a scale of one or more expressions of the first 3D geometry; and performing one or more operations to align the second 3D geometry with the first 3D geometry when a bottom portion of the second 3D geometry is combined with a top portion of the first 3D geometry.
10 . The computer-implemented method of claim 1 , wherein performing the one or more operations via the one or more machine learning models to render the second video frame comprises: performing one or more operations to convert the texture map to a neural texture; performing one or more operations to convert the lighting map to a neural lighting; and processing, using a first trained machine learning model, the aligned second 3D geometry, the first video frame of the first media content item, a combination of the neural texture and the neural lighting, and an inpainting map that indicates one or more regions of the first video frame of the first media content item to inpaint to generate the second video frame.
11 . One or more non-transitory computer-readable media storing instructions that, when executed by at least one processor, cause the at least one processor to perform steps comprising: generating first three-dimensional (3D) geometry, a texture map, and a lighting map based on a face of an actor included in a first video frame of a first media content item; generating second 3D geometry of another face based on audio associated with a dubber included in a second media content item; performing one or more operations that align the second 3D geometry with the first 3D geometry to generate an aligned second 3D geometry; and performing one or more operations, via one or more machine learning models, to render a second video frame based on the aligned second 3D geometry, the texture map, the lighting map, and the first video frame of the first media content item.
12 . The one or more non-transitory computer-readable media of claim 11 , wherein generating the second 3D geometry comprises performing one or more operations to process the audio associated with the dubber using another trained machine learning model that outputs the second 3D geometry.
13 . The one or more non-transitory computer-readable media of claim 12 , wherein the another trained machine learning model comprises a sequential decoder.
14 . The one or more non-transitory computer-readable media of claim 12 , wherein the another trained machine learning model comprises an encoder that encodes the audio associated with the dubber into an embedding in an expression space and a decoder that decodes the embedding into the second 3D geometry.
15 . The one or more non-transitory computer-readable media of claim 12 , wherein the instructions, when executed by the at least one processor, further cause the at least one processor to perform the step of performing one or more autoregressive operations to train a machine learning model to generate the another trained machine learning model.
16 . The one or more non-transitory computer-readable media of claim 12 , wherein the instructions, when executed by the at least one processor, further cause the at least one processor to perform the step of performing one or more operations to train a machine learning model to generate the another trained machine learning model based on audio from one or more media content items, wherein the audio from the one or more media content items includes speech in at least two different languages.
17 . The one or more non-transitory computer-readable media of claim 12 , wherein the instructions, when executed by the at least one processor, further cause the at least one processor to perform the step of performing one or more operations to re-train a previously trained machine learning model based on audio from one or more media content items to generate the another trained machine learning model.
18 . The one or more non-transitory computer-readable media of claim 11 , wherein generating the first 3D geometry comprises: detecting a plurality of landmarks on the face of the actor in the first video frame; performing one or more operations to fit an intermediate 3D geometry based on the plurality of landmarks, wherein the intermediate 3D geometry is defined using one or more parameters; and performing one or more operations to modify the one or more parameters of the intermediate 3D geometry based on the first video frame and a loss function.
19 . The one or more non-transitory computer-readable media of claim 11 , wherein the second video frame is rendered to include at least a portion of the face of the actor.
20 . A system, comprising: a memory storing instructions; and a processor that is coupled to the memory and, when executing the instructions, is configured to perform the steps of: generating first three-dimensional (3D) geometry, a texture map, and a lighting map based on a face of an actor included in a first video frame of a first media content item, generating second 3D geometry of another face based on audio associated with a dubber included in a second media content item, performing one or more operations that align the second 3D geometry with the first 3D geometry to generate an aligned second 3D geometry, and performing one or more operations, via one or more machine learning models, to render a second video frame based on the aligned second 3D geometry, the texture map, the lighting map, and the first video frame of the first media content item.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of the co-pending U.S. patent application titled, “TECHNIQUES FOR GENERATING DUBBED MEDIA CONTENT ITEMS,” filed on Dec. 26, 2023, and having Ser. No. 18/396,578. The subject matter of this related application is hereby incorporated herein by reference. BACKGROUND Field of the Invention Embodiments of the present disclosure relate generally to video processing, computer science, and machine learning and, more specifically, to techniques for generating dubbed media content items. Description of the Related Art Dubbing is a process in which the audio of a media content item that also includes video, such as a film or television show, is replaced with audio in a different language. One conventional approach for dubbing is to carefully select words in the different language that, when spoken, roughly match the facial movements of an actor in a given media content item. However, because the actor in the media content item is not speaking the same language as the audio in the different language, there are invariably noticeable disparities between the facial movements of the actor in the media content item and the audio in the different language. Another conventional approach for dubbing is to capture the face of an actor in a media content item using a facial capture system. A graphics rendering engine can then render images of the captured face making different expressions that correspond to audio in a different language. One drawback of this approach, however, is that conventional graphics engines oftentimes require considerable amounts of time to render images. A further drawback of this approach is that, as a general matter, conventional graphics engines are unable to render images of faces that look photorealistic. Accordingly, the face of an actor depicted in a media content item that includes such renderings can end up resembling the face of a character in a video game. Yet another drawback of this approach is that the face of an actor needs to be captured using a complex facial capture system, which may not be available to the producer of a given media content item. As the foregoing illustrates, what is needed in the art are more effective techniques for generating dubbed media content items. SUMMARY OF THE EMBODIMENTS One embodiment of the present disclosure sets forth a computer-implemented method for generating a dubbed media content item. The method includes generating first three-dimensional (3D) geometry, a texture map, and a lighting map based on a face of an actor included in a first video frame of a first media content item. The method further includes generating second 3D geometry based on a face of a dubber included in a second video frame of a second media content item. The method also includes performing one or more operations that align the second 3D geometry with the first 3D geometry to generate an aligned second 3D geometry. In addition, the method includes performing one or more operations via one or more trained machine learning models to render a third video frame based on the aligned second 3D geometry, the texture map, the lighting map, and the first video frame of the first media content item. Other embodiments of the present disclosure include, without limitation, one or more computer-readable media including instructions for performing one or more aspects of the disclosed techniques as well as a computing device for performing one or more aspects of the disclosed techniques. At least one technical advantage of the disclosed techniques relative to the prior art is that the disclosed techniques can be used to generate dubbed media content items that include photorealistic videos that closely match dubbed audio in a different language. The disclosed techniques are also, as a general matter, faster than conventional graphics rendering techniques for rendering faces. In addition, the disclosed techniques do not require a facial capture system to generate dubbed media content items. Accordingly, the disclosed techniques can be implemented in post production to generate photorealistic dubbed media content that is more enjoyable to viewers than traditional dubbed media content. These technical advantages represent one or more technological improvements over prior art approaches. BRIEF DESCRIPTION OF THE DRAWINGS So that the manner in which the above recited features of the present disclosure can be understood in detail, a more particular description of the disclosure, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this disclosure and are therefore not to be considered limiting of its scope, for the disclosure may admit to other equally effective embodiments. FIG. 1 illustrates a system configured to implement one or more aspects of the various embodi