EP-3912085-B1 - SYSTEMS AND METHODS FOR FACE REENACTMENT

EP3912085B1EP 3912085 B1EP3912085 B1EP 3912085B1EP-3912085-B1

Inventors

SAVCHENKOV, PAVEL
MATOV, DMITRY
MASHRABOV, Aleksandr
PCHELNIKOV, Alexey

Dates

Publication Date: 20260506
Application Date: 20200118

Claims (14)

A computer-implemented method for face reenactment, the method comprising: receiving (505) a target video (125), the target video including at least one target frame (320), the at least one target frame (320) including a target face (140); receiving (510) a source video (305), the source video (305) including a source face (315); determining (515), using a parametric face model (405) and a texture model (410), a target set of target parameters based on the target face (140) in the at least one frame (320) of the target video (125), the target set of target parameters including at least a target facial expression; determining (520), using the parametric face model (405) and the texture model (410), a source set of source parameters based on the source face (315) in a frame (310) of the source video (305), the source set of source parameters including at least a source facial expression; synthesizing (525), using the parametric face model (405) and the texture model (410), an output face (350), the output face (350) including the target face (140), wherein the at least a target facial expression is modified to imitate the source facial expression by applying a deformation transfer to the target set of target parameters to account for the at least a source facial expression of the source face (315); generating (530), based on a deep neural network (415), DNN, the at least a source facial expression and at least one previous frame of the target video (125), a mouth region and an eyes region, the mouth region and the eyes region generated using the DNN (415) being more photorealistic than a mouth region and an eyes region synthesized by the parametric face model (405) and the texture model (410); and combining (535) the output face (350), the mouth region, and the eyes region to generate a frame (345) of an output video (340), wherein the mouth region and the eyes region generated using the DNN (415) are used to replace the mouth region and the eyes region synthesized by the parametric face model (405) and the texture model (410); wherein the parametrical face model (405) includes a template mesh pre-generated based on historical images of faces of a plurality of individuals, the template mesh including a pre-determined number of vertices.
The method of claim 1, wherein the parametrical face model (405) depends on a facial expression, a facial identity, and a facial texture.
The method of claim 1 or 2, wherein the texture model (410) includes a set of colors associated with the vertices.
The method of claim 1, 2 or 3, wherein the individuals are of different ages, gender, and ethnicity.
The method of any of the preceding claims, wherein the historical images of faces includes at least one set of pictures belonging to a single individual having a pre-determined number of facial expressions.
The method of claim 5, wherein the facial expressions include at least one of a neutral expression, a mouth-open expression, a smile, and an angry expression.
The method of claim 6, wherein the parametrical face model (405) includes a set of blend shapes, the blend shapes representing the facial expressions.
The method of any of the preceding claims, wherein an input of the DNN includes at least parameters associated with the parametric face model (405).
The method of any of the preceding claims, wherein an input of the DNN includes a previous mouth region and a previous eyes region, the previous mouth region and the previous eyes region being associated with at least one previous frame of the target video (125).
The method of any of the preceding claims, wherein the DNN is trained using historical images of faces of a plurality of individuals.
The method of any of the preceding claims, wherein the method is performed by a computing device (110, 900).
A computing system comprising at least one processor (210, 910) configured to perform the method of any of claims 1 to 10.
A non-transitory processor-readable medium comprising instructions, which when executed by a computer (110, 900) cause the computer (110, 900) to perform the method of any of claims 1 to 10.
A computer program product comprising instructions which, when the program is executed by a computer (110, 900), cause the computer (110, 900) to carry out the method of any of claims 1 to 10.

Description

TECHNICAL FIELD This disclosure generally relates to digital image processing. More particularly, this disclosure relates to methods and systems for face reenactment. BACKGROUND Face reenactment may include transferring a facial expression of a source individual in a source video to a target individual in a target video or a target image. The face reenactment can be used for manipulation and animation of faces in many applications, such as entertainment shows, computer games, video conversations, virtual reality, augmented reality, and the like. Some current techniques for face reenactment utilize morphable face models to re-render the target face with a different facial expression. While generation of a face with a morphable face model can be fast, the generated face may not be photorealistic. Some other current techniques for face reenactment can be based on use of deep learning methods to re-render the target face. The deep learning methods may allow obtaining photorealistic results. However, the deep learning methods are time-consuming and may not be suitable to perform a real-time face reenactment on regular mobile devices. The scientific paper "Warp-Guided GANs for Single-Photo Facial Animation" by Geng et al. discloses a method based on a single portrait photo and a set of facial landmarks derived from a driving source to generate an animated image. Global 2D warp is performed on the target portrait photo. The displacements of control points are transferred from the motion parameters of the driving source. Algorithm of Displaced Dynamic Expression (DDE) is used to track the face of the source person, which simultaneously detects the 2D facial landmarks and recovers the 3D blendshapes, as well as the corresponding expression and the 3D pose. DDE is also used to recover these initial properties for the target image. The expression and 3D pose is transferred from the source to the target. The transformed target 3D facial landmarks on the face mesh are projected onto the target image to get the displaced 2D landmarks. Confidence-aware warping is applied to the target image. The facial region is extracted and the 2D facial landmarks are interpolated to generate a per-pixel displacement map which carries the fine movements of the face under the global 2D warp. The displacement map is fed into a generative adversarial neural network together with the warped face image, to generate a final detail-refined facial image. The scientific paper "Real-time Face Capture and Reenactment of RGB Videos" by Thies et al. discloses the Algorithm of Displaced Dynamic Expression (DDE) in more detail and relates to a method for reenacting a monocular target video sequence (e.g., from Youtube) based on the expressions of a source actor who is recorded live with a commodity webcam. The shape identity of the target actor is reconstructed using a global non-rigid model-based bundling approach based on a prerecorded video training sequence of the actor. The scientific paper "Face2Face: Real-Time Face Capture and Reenactment of RGB Videos" by Thies et al. discloses a method to animate the facial expressions of the target video by a source actor and re-render the manipulated output video in a photo-realistic fashion. SUMMARY The invention is defined in the independent claims. Particular embodiments are set out in the dependent claims. This section is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description section. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. According to one embodiments of the disclosure, a method for face reenactment is provided. The method may include receiving, by a computing device, a target video. The target video may include at least one target frame. The at least one target frame may include a target face. The method may further include receiving, by the computing device, a source video. The source video may include a source face. The method may further include determining, by a computing device and based on the target face in the at least one frame of the target video, at least a target facial expression. The method may further include determining, by the computing device and based on the source face in a frame of the source video, at least a source facial expression. The method may further include synthesizing, by the computing device and using a parametric face model, an output face. The output face may include the target face wherein the target facial expression is modified to imitate the source facial expression. The method may further include generating, by the computing device and based on a deep neural network (DNN), a mouth region and an eyes region. The method may further include combining, by the computing device, the output face, the mouth region, and the eyes region to