KR-20260065918-A - Generative neural network lip reproduction driven by controllable facial landmarks

KR20260065918AKR 20260065918 AKR20260065918 AKR 20260065918AKR-20260065918-A

Abstract

A method, system, and apparatus comprising a computer program encoded on a computer storage medium, for adapting a first video of a speaker speaking with a first speech variant characterized by a first lip movement trajectory into a second video of a speaker appearing to speak with a second target speech variant characterized by a second lip movement trajectory. In one embodiment, the method comprises the steps of processing a first image of a speaker describing a speaker speaking with a first speech variant using a speaker-independent generative model in each individual input video frame of an input video, and combining one or more frame images generated for each video frame of the input video to generate an output video of a first speaker speaking with a second speech variant.

Inventors

맥카트니 주니어 테런스 폴
콰트라 비벡
콜로나 브라이언 로널드
배성민

Assignees

구글 엘엘씨

Dates

Publication Date: 20260511
Application Date: 20231004

Claims (20)

As a method, A step of receiving an input video frame from an input video depicting a speaker speaking with a first speech variation - said first speech variation is characterized by a first lip movement trajectory -; A step of generating a predicted facial landmark coordinate geometric structure of a second speech deformation characterized by a second lip movement trajectory; A step of acquiring a plurality of reference images for an input video frame; and A method comprising the step of generating an output video frame of a speaker having a target lip geometric structure that matches a second lip movement trajectory based on a predicted facial landmark coordinate geometric structure and a plurality of reference images using a generator neural network.
In paragraph 1, A step of receiving a second input video frame from a second input video depicting a second speaker speaking with a third speech variation - said third speech variation is characterized by a third lip movement trajectory -; A step of generating a predicted facial landmark coordinate geometric structure of a fourth speech deformation characterized by a fourth lip movement trajectory; A step of acquiring a plurality of reference images for a second input video frame; and A method further comprising the step of generating an output video frame of a speaker having a target lip geometric structure that matches a fourth lip movement trajectory based on a plurality of reference images for a fourth input video frame and a predicted facial landmark coordinate geometric structure of a fourth speech variation using a generator neural network.
In paragraph 1 or 2, The above second speaker is the same speaker as the above speaker, method.
In any one of paragraphs 1 through 3, The above second speech variation is a method that is a translation of the first speech variation.
In any one of paragraphs 1 through 4, A method in which the output video frame of the above speaker is cropped more than the input video frame.
In any one of paragraphs 1 through 5, The step of acquiring a plurality of reference images for the above input video frame is, A step of acquiring a reference pose image that characterizes the target head pose of the speaker within an input video frame; A step of acquiring a reference lip image of the speaker that characterizes the geometric structure of the speaker's target lips within an input video frame; and A method comprising the step of obtaining N context image sequences that temporally contextualize an input video frame from a sequence of N video frames adjacent to an input video frame.
In paragraph 6, A method in which the output video frame is conditioned according to the predicted facial landmark coordinate geometric structure to resolve one or more discrepancies between the speaker's reference lip image and the target lip geometric structure.
In any one of paragraphs 1 through 7, A method wherein the predicted facial landmark coordinate geometric structure is encoded into a two-dimensional matrix having the same width and height as the input video frame and includes geometric representations of multiple visible speaker facial landmarks.
In paragraph 8, The step of encoding the geometric representation of the plurality of visible speaker facial landmarks mentioned above is, A step of representing multiple visible left facial landmarks of speakers by assigning multiple negative values; and A method comprising the step of representing multiple visible speaker right facial landmarks by assigning multiple positive values.
In Paragraph 9, A method in which the number of positive or negative values assigned around a center point representing the center of a specific visible speaker facial landmark represents the uncertainty of a specific visible speaker facial landmark.
When subordinate to Article 6, in any one of Articles 1 through 5, The step of acquiring N context image sequences when N video frame sequences adjacent to the input video frame do not depict the speaker speaking is: A step of repeating a first adjacent video frame N times; or A method comprising the step of reproducing a sequence of K preceding or succeeding video frames adjacent to an input video frame by interpolating lip movement trajectories into a sequence of K preceding or succeeding video frames adjacent to an input video frame.
In any one of paragraphs 1 through 11, A method in which the generator neural network is adversarially trained using a plurality of discriminators, each generating an individual discriminant prediction as generator feedback to characterize whether a plurality of individual discriminator inputs, including an input video frame, are real discriminator inputs or fake discriminator inputs, wherein the real discriminator inputs match the input video and the fake discriminator inputs do not match the input video.
In Clause 12, when dependent on Claim 6, The above plurality of discriminators are, A facial expression discriminator configured to process multiple real or fake facial expression discriminator inputs including input video frames to generate a facial expression discrimination prediction as generator feedback that defines the probability that the input video frames are characterized by a target lip geometric structure; A fusion discriminator configured to process multiple real or fake fusion discriminator inputs including input video frames to generate a fusion discriminant prediction as generator feedback that defines the probability that the input video frames match a reference image; or A method comprising two or more sequence discriminators configured to process a plurality of real or fake sequence discriminator inputs including input video frames to generate a sequence discriminant prediction as generator feedback that defines the probability that the input video frames match N context image sequences.
In Paragraph 13, The above plurality of fake facial expression discriminator inputs include a fake facial landmark coordinate geometric structure calculated from the detection of the speaker's facial landmarks from different video frames of the generated output video frame and the input video; A method in which the above plurality of actual facial expression discriminator inputs include an input video frame and an actual facial landmark coordinate geometric structure calculated from the detection of the speaker's facial landmarks from the input video.
In paragraph 13 or 14, The above plurality of fake fusion discriminator inputs include generated output video frames, and the fake reference pose and fake reference lip images include reference pose and reference lip images used to generate output video frames by the generator; A method wherein the plurality of actual fusion discriminator inputs include input video frames, and the actual reference pose and actual reference lip images include reference pose and reference lip images from a video frame near the input video frame or from a deformed frame that closely matches the head pose and lip geometry of the input video frame.
In any one of paragraphs 13 through 15, The above-mentioned input to the plurality of fake sequence discriminators includes the generated output video frame and N context image sequences; A method in which the above plurality of actual sequence discriminator inputs include an input video frame and N context image sequences.
In any one of paragraphs 13 through 16, A method wherein the generated output video frame and input video frame processed by the facial expression and fusion discriminator include a real or fake mouth area image cropped around the speaker's mouth area in the generated output or input video frame.
In any one of paragraphs 12 through 17, A method in which the discriminator with the highest score provides generator feedback to the generator neural network during adversarial training.
A system comprising one or more computers and one or more storage devices, wherein the one or more storage devices store instructions that, when executed by one or more computers, cause one or more computers to perform the method of any one of claims 1 to 18.
A computer storage medium encoded as a computer program, wherein the program comprises instructions operable to enable the data processing device to perform the method of any one of claims 1 to 18 when the program is executed by a data processing device.

Description

Generative neural network lip reproduction driven by controllable facial landmarks This specification relates to processing data using machine learning models. A machine learning model receives input and generates an output (e.g., a predicted output) based on the received input. Some machine learning models are parametric models and generate an output based on the received input and the model's parameter values. Some machine learning models are deep models that use multiple layers of models to generate an output for a received input. For example, a deep neural network is a deep machine learning model that includes an output layer and one or more hidden layers, where each layer generates an output by applying a non-linear transformation to the received input. Figure 1 illustrates an overview of a speaker-independent lip reproduction system. Figure 2 shows an exemplary representation of a facial landmark geometric structure. Figure 3 provides an overview of exemplary inputs to a generative speaker-independent lip rendering model for generating output frame images. Figure 4 illustrates an overview of a speaker-independent lip rendering model with a generative adversarial neural network architecture and an exemplary training process. FIG. 5 illustrates an exemplary expression discriminator that ensures the generated output image has the correct facial landmark geometric structure. FIG. 6 illustrates an example of an exemplary sequence discriminator that ensures the generated output image corresponds to the temporal context of surrounding frames. FIG. 7 illustrates an example of an exemplary fusion discriminator that ensures the generated output image combines the input in a manner consistent with the input video. Figures 8a, 8b, and 8c show exemplary methods for achieving a smooth lip recap transition when the speaker speaks out of the camera or remains silent before or after speaking. Figure 9 is a flowchart of an exemplary process for generating a lip reproduction video from an input video. In various drawings, similar reference symbols and names represent similar elements. FIG. 1 illustrates a speaker-independent lip reanimation system (100). A speaker-independent lip reproduction system (100) is an example of a system implemented as a computer program on one or more computers in one or more locations in which the system, components, and techniques described below are implemented. In the specific example illustrated, the speaker-independent lip reproduction system (100) includes a geometric structure prediction model (130), a lip rendering model (140), and an upscaling model (150). In this case, the models (130, 140, 150) are not speaker-specific and can provide an end-to-end system for speaker-independent lip reproduction. That is, the models of the system can be used extensively for lip reproduction across different speakers and do not need to be swapped with speaker-specific models to generate reproduction videos for different speakers. This specification focuses on functions and techniques related to training a lip rendering model (140). In particular, FIG. 1 illustrates a lip rendering model (140) within a broader speaker-independent representation architecture that includes a geometric prediction model (130) and an upscaling model (150). More specifically, the geometric structure prediction model (130) can generate a facial landmark geometric structure (135) of the target lip movement trajectory for the lip rendering model (140), which generates an output image (145) of the speaker speaking the target lip movement trajectory. In particular, the geometric structure prediction model (130) can process the raw audio of the target speech variant (120) and the input video (110) of the speaker to generate an intended facial landmark geometric structure (135) for the speaker. In some examples, the geometric structure prediction model (130) can calculate the facial landmark geometric structure (135) from facial landmark detection of the speaker's face from different parts of the input video (110). As an example, the facial landmark geometric structure model can process a video of a speaker speaking in French and a speaker speaking in English to generate a facial landmark geometric structure (135) that matches the speaker speaking in French. In some cases, the facial landmark geometric structure (135) is a numerical representation of the facial landmarks of the speaker's face in the input video (110) in three dimensions (3D coordinates). Facial landmarks are essential attributes of the human face, such as eyes, nose, and mouth, which can be used to distinguish different faces. As an additional example, the facial landmark geometric structure (135) may include details such as the corners of the eyes or dimples. In this specification, the facial landmark geometric structure (135) may be used to characterize the lip shape and facial expression of the frame intended to be reproduced. As an example, the facial landma