CN-121982063-A - Speaker face video generation method based on neural radiation field space position optimization

CN121982063ACN 121982063 ACN121982063 ACN 121982063ACN-121982063-A

Abstract

The invention relates to a speaker face video generation method based on nerve radiation field space position optimization, which comprises the steps of extracting voice information and a face image area from a reference video, sampling the face image area of a first frame of video frame, tracking the position of sampling points on a later video frame, determining N sampling points as two-dimensional key points, defining N three-dimensional space key points in a three-dimensional space, mapping the three-dimensional space key points to an image plane where the sampling points are located by combining rotation translation parameters of camera external parameters, camera internal parameters and space points to generate N two-dimensional plane points, calculating Euclidean distances between the N two-dimensional key points and the N two-dimensional plane points, adjusting relevant parameters and the three-dimensional space key point positions, obtaining a three-dimensional face reconstruction model of a target face, and generating a speaker face video of the target face based on a target viewing angle, target voice information and the three-dimensional face reconstruction model of the target face.

Inventors

ZHANG NING
HE QIN
ZHAO SHUAI

Assignees

北京中科睿鉴科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251231

Claims (7)

1. A speaker face video generation method based on neural radiation field space position optimization is characterized by comprising the following steps: extracting each video frame and voice information corresponding to each video frame from a reference video containing a target face, and extracting a face image area from the video frame; Sampling a face image area of a first frame of video frame by adopting a point tracking algorithm, and tracking the position of a sampling point on a later video frame; determining N sampling points as two-dimensional key points based on tracking results of a point tracking algorithm; defining N trainable three-dimensional space key points in a three-dimensional space, and mapping the three-dimensional space key points to an image plane where sampling points are positioned by combining rotation and translation parameters of camera external parameters, camera internal parameters and space points to generate N two-dimensional plane points; Calculating Euclidean distances between the N two-dimensional key points and the N two-dimensional plane points, and adjusting camera external parameters, camera internal parameters, space point rotation translation parameters and three-dimensional space key point positions based on the Euclidean distances; based on the voice information, camera external parameters, camera internal parameters, space point rotation translation parameters and three-dimensional space key point positions corresponding to the video frames, a three-dimensional face reconstruction model of the target face is obtained; And generating a speaker face video of the target face under the target visual angle and the target voice information based on the target visual angle and the target voice information and the three-dimensional face reconstruction model of the target face.
2. The method for generating a face video of a speaker based on spatial location optimization of a neural radiation field according to claim 1, wherein the determining of N sampling points as two-dimensional key points based on tracking results of a point tracking algorithm comprises: Carrying out semantic segmentation on a first frame image of a reference video to obtain the position of a face region; selecting all sampling points falling on a face area in a first frame image; further screening the points obtained in the second step, wherein the points are visible in each subsequent frame; Among sampling points satisfying the above conditions, 50 points are randomly extracted as valid points.
3. The method for generating a face video of a speaker based on spatial location optimization of a neural radiation field according to claim 1, wherein the adjusting the camera external parameters, the camera internal parameters, the spatial point rotation translation parameters, and the three-dimensional spatial key point locations based on euclidean distance comprises: adjusting the positions of the three-dimensional space key points based on Euclidean distance to obtain first positions of the three-dimensional space key points; Based on Euclidean distance, adjusting the positions of the camera external parameters and the three-dimensional space key points to obtain second positions of the adjusted camera external parameters and the three-dimensional space key points; Based on Euclidean distance, adjusting the camera internal parameter, the space point rotation translation parameter and the three-dimensional space key point position to obtain the adjusted camera internal parameter, the adjusted space point rotation translation parameter and the third position of each three-dimensional space key point.
4. The method for generating a speaker face video based on spatial position optimization of a neural radiation field according to claim 1, wherein the obtaining a three-dimensional face reconstruction model of a target face based on the voice information, the camera external parameters, the camera internal parameters, the spatial point rotation translation parameters and the three-dimensional spatial key point positions corresponding to the video frame and the video frame comprises: Based on the camera external parameters, the camera internal parameters and the space point rotation translation parameters corresponding to each video frame and the video frame, combining the space position information of the three-dimensional space key points, and training through a nerve radiation field technology to obtain a three-dimensional model of the target face; And taking voice information, camera external parameters, camera internal parameters and space point rotation translation parameters corresponding to the video frames as inputs, taking corresponding video frames as outputs, and combining a three-dimensional face model to obtain the video with consistent voice and lip of the target person by reasoning.
5. A speaker face video generating device based on neural radiation field spatial position optimization, comprising: The information extraction module is used for extracting each video frame and the voice information corresponding to each video frame from the reference video containing the target face and extracting a face image area from the video frame; the point tracking module is used for sampling the face image area of the first frame of video frame by adopting a point tracking algorithm and tracking the position of a sampling point on the subsequent video frame; The key point module is used for determining N sampling points as two-dimensional key points based on the tracking result of the point tracking algorithm; The point mapping module is used for defining N trainable three-dimensional space key points in a three-dimensional space, mapping the three-dimensional space key points to an image plane where sampling points are positioned by combining rotation and translation parameters of camera external parameters, camera internal parameters and space points, and generating N two-dimensional plane points; the parameter adjustment module is used for calculating Euclidean distances between the N two-dimensional key points and the N two-dimensional plane points and adjusting camera external parameters, camera internal parameters, space point rotation translation parameters and three-dimensional space key point positions based on the Euclidean distances; the model construction module is used for obtaining a three-dimensional face reconstruction model of the target face based on the voice information, the camera external parameters, the camera internal parameters, the space point rotation translation parameters and the three-dimensional space key point positions corresponding to the video frames; The video generation module is used for generating a speaker face video of the target face under the target visual angle and the target voice information based on the target visual angle and the target voice information and the three-dimensional face reconstruction model of the target face.
6. A storage medium having stored thereon a computer program executable by a processor, wherein the computer program when executed performs the steps of the neural radiation field spatial location optimization-based face video generation method of any one of claims 1 to 4.
7. A face video generating device having a memory and a processor, the memory storing a computer program executable by the processor, wherein the computer program when executed implements the steps of the neural radiation field based spatial location optimization face video generating method of any one of claims 1-4.

Description

Speaker face video generation method based on neural radiation field space position optimization Technical Field The invention relates to a speaker face video generation method based on neural radiation field spatial position optimization. Is suitable for the fields of machine learning and computer vision. Background The rapid development of artificial intelligence generation algorithms has greatly facilitated the revolution in the field of video generation, which is particularly prominent in terms of face video generation technology. The speaker face video generation task is that the model is able to generate lip-sync speaker video of a person, given the voice frequency of a person and the image or video of a target person. Currently, the technology has a plurality of practical application cases, such as AI interviews, digital live broadcasting and other scenes. In recent years, the neural radiation field technique (NeRF) has received widespread attention in the field of computer vision. The technology implicitly models the three-dimensional scene through the multi-view image and the corresponding camera parameter information, and aims to learn the RGB color value and the density value of each pixel point in the scene, so that accurate three-dimensional reconstruction is realized. NeRF is gradually expanding to fields such as speaker face video generation, thanks to its excellent visual generation effect. NeRF is gradually expanding to fields such as speaker face video generation, thanks to its excellent visual generation effect. In 2021, guo et al proposed AD-NeRF, which uses audio information as a conditional feature, and splices it with image position and view angle vectors, and inputs it into a multi-layer perceptron together for modeling, so as to implement semantic matching of audio-visual signals. Besides, the work also provides that the head and the trunk of the person are separately trained by using two nerve radiation fields, so that a better rendering effect is achieved. Li et al propose ER-NeRF, the method optimizes the space region division by using three plane hash encoders, thereby reducing the scale of the feature grid and remarkably improving the training efficiency and rendering speed of the model. Meanwhile, in order to solve the problem of separation of the head and the trunk, ER-NeRF introduces a self-adaptive gesture coding strategy. Unlike conventional methods that directly use the entire image or pose matrix as a condition, the technique maps complex head pose transformations into spatial coordinates with more explicit position information, thereby optimizing the pose performance of the torso and significantly improving the accuracy and naturalness of the composite effect. Yu et al found that early neural radiation field-based generation methods exhibited poor generalization ability when the model was processing off-domain audio due to limited training data size. In response to this problem they propose Geneface that trains a variational motion generator on a large corpus data set and uses Syncnet networks to monitor the synchronicity of audio and keypoints to generate diverse features that contain rich audio-motion mapping information. In order to ensure that the audio-motion mapping features can adapt to specific characters, the work also adopts an countermeasure network to carry out domain adaptation training on the features, so that the features of a large-scale data set are migrated to a target angle color gamut, and a more realistic generation effect is realized. The existing method for generating the face video of the speaker based on the nerve radiation field still has certain defects in the control of the head movement. When the head movement amplitude of the speaking person is large, the existing method is generally difficult to capture accurate spatial position information, so that a blurred region exists in the generated video, and the visual effect is affected. In addition, talking face videos generated based on neural radiation fields are prone to imperfections in scaling of the person's head. Disclosure of Invention Aiming at the problems, the invention provides a speaker face video generation method based on the spatial position optimization of a nerve radiation field. The technical scheme adopted by the invention is that the method for generating the speaker face video based on the spatial position optimization of the nerve radiation field comprises the following steps: extracting each video frame and voice information corresponding to each video frame from a reference video containing a target face, and extracting a face image area from the video frame; Sampling a face image area of a first frame of video frame by adopting a point tracking algorithm, and tracking the position of a sampling point on a later video frame; determining N sampling points as two-dimensional key points based on tracking results of a point tracking algorithm; defining N trainable three-dime