CN-121999404-A - Video processing method, device, electronic equipment and storage medium

CN121999404ACN 121999404 ACN121999404 ACN 121999404ACN-121999404-A

Abstract

The embodiment of the application discloses a video processing method, a device, electronic equipment and a storage medium, wherein the method comprises the steps of extracting a standard image from a reference face video; extracting a time deformation field of each video frame relative to the standard image from the reference face video, wherein the time deformation field comprises facial motion information, carrying out appointed style processing on the standard image to obtain a stylized image, and generating a face animation video according to the stylized image and the time deformation field. The embodiment of the application improves the generation efficiency of the face animation video, avoids the incoherence and the offensiveness of the face animation video, and improves the overall fluency of the video.

Inventors

SUN WENZHANG
Di Donglin
MA YONGJIA
LI HAO
CHEN WEI

Assignees

北京罗克维尔斯科技有限公司

Dates

Publication Date: 20260508
Application Date: 20241108

Claims (10)

1.A video processing method, comprising: Extracting a canonical image from a reference face video; extracting a time deformation field of each video frame relative to the standard image from the reference face video, wherein the time deformation field comprises face motion information; Carrying out appointed style processing on the standard image to obtain a stylized image; And generating a facial animation video according to the stylized image and the time deformation field.
2. The method of claim 1, wherein said extracting a temporal deformation field of each video frame relative to the canonical image from the reference face video comprises: extracting image features from each video frame of the reference face video; identifying implicit keypoints from the image features that characterize facial motion; The temporal deformation field is determined based on the implicit keypoints of each video frame and the canonical image.
3. The method of claim 2, wherein said determining the temporal deformation field based on the implicit keypoints of each video frame and the canonical image comprises: Determining a motion vector of an implicit key point in each video frame relative to the implicit key point in the standard image to obtain sparse motion prompt information; and generating the time deformation field according to the sparse motion prompt information.
4. The method of claim 2, further comprising, prior to said determining the temporal deformation field from the implicit keypoints of each video frame and the canonical image: acquiring a manual control track of a specified facial organ in the reference face video; said determining said temporal deformation field from said canonical image and from implicit keypoints for each video frame comprises: And determining the time deformation field according to the implicit key point of each video frame, the manual control track and the standard image.
5. The method of claim 4, wherein said determining said temporal deformation field based on the implicit keypoints of each video frame, said manual control trajectory, and said canonical image comprises: Discretizing the manual control track to obtain a first motion vector of each discrete point on a time axis; Determining a second motion vector of the implicit keypoint in each video frame relative to the implicit keypoint in the canonical image; Overlapping the first motion vector into an implicit key point representing the appointed facial organ in the second motion vector to obtain sparse motion prompt information; and generating the time deformation field according to the sparse motion prompt information.
6. The method of claim 3 or 5, wherein the temporal deformation field of each video frame is a two-dimensional vector field characterizing the displacement of each pixel in the video frame relative to a corresponding pixel in the canonical image; the generating the time deformation field according to the sparse motion prompt information comprises the following steps: And generating the time deformation field through a sparse-to-dense motion generation network according to the sparse motion prompt information.
7. The method of claim 6, wherein the training object of the sparse-to-dense motion generating network during training comprises minimizing a square of a difference between a first difference and a motion field variation value at two adjacent times, the first difference being a difference between the dense motion fields at the two adjacent times.
8. A video processing apparatus, comprising: The canonical image extraction module is used for extracting canonical images from the reference face video; A time deformation field extraction module, configured to extract a time deformation field of each video frame relative to the canonical image from the reference face video, where the time deformation field includes facial motion information; The stylized processing module is used for carrying out appointed style processing on the standard image to obtain a stylized image; And the video generation module is used for generating a face animation video according to the stylized image and the time deformation field.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the video processing method of any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the video processing method of any one of claims 1 to 7.

Description

Video processing method, device, electronic equipment and storage medium Technical Field The embodiment of the application relates to the technical field of video processing, in particular to a video processing method, a video processing device, electronic equipment and a storage medium. Background In the field of facial animation video generation, although the current technology has advanced to some extent, many challenges and limitations are still faced in practical application. First, long generation times and low computational efficiency are a significant problem because the entire video needs to be processed. Although some methods can generate high-quality video content, a large amount of computing resources are often consumed, so that the generation speed is slow, and the requirements of real-time application are difficult to meet. The bottleneck of the calculation efficiency seriously affects the wide application of the technology, and the improvement of the high efficiency and the instantaneity of the model is a problem to be solved currently. Furthermore, the prior art often shows video discontinuities and violations when dealing with large amplitude facial movements. Because the model has certain limitation in capturing the details of complex movements, especially when dealing with the severe head movements or multi-angle transformation, the generated video often has the problems of dislocation, image distortion and the like. The unnatural phenomenon of the inter-frame motion not only affects the overall fluency of the video, but also reduces the viewing experience of the user. Disclosure of Invention The embodiment of the application provides a video processing method, a video processing device, electronic equipment and a storage medium, which are beneficial to improving the generation efficiency of a face animation video and improving the overall fluency of the video. In order to solve the above problems, in a first aspect, an embodiment of the present application provides a video processing method, including: Extracting a canonical image from a reference face video; extracting a time deformation field of each video frame relative to the standard image from the reference face video, wherein the time deformation field comprises face motion information; Carrying out appointed style processing on the standard image to obtain a stylized image; And generating a facial animation video according to the stylized image and the time deformation field. In a second aspect, an embodiment of the present application provides a video processing apparatus, including: The canonical image extraction module is used for extracting canonical images from the reference face video; A time deformation field extraction module, configured to extract a time deformation field of each video frame relative to the canonical image from the reference face video, where the time deformation field includes facial motion information; The stylized processing module is used for carrying out appointed style processing on the standard image to obtain a stylized image; And the video generation module is used for generating a face animation video according to the stylized image and the time deformation field. In a third aspect, an embodiment of the present application further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the video processing method according to the embodiment of the present application is implemented when the processor executes the computer program. In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the video processing method disclosed in the embodiments of the present application. According to the video processing method, the video processing device, the electronic equipment and the storage medium, the standard image is extracted from the reference face video, the time deformation field of each video frame relative to the standard image is extracted from the reference face video, the time deformation field comprises facial motion information, the standard image is subjected to stylization processing to obtain the stylized image, and the face animation video is generated according to the stylized image and the time deformation field. Drawings In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Fig. 1 is a flowchart of a video processing method according to an embodime