CN-121768058-B - Face information recognition optimization method and system

CN121768058BCN 121768058 BCN121768058 BCN 121768058BCN-121768058-B

Abstract

The invention provides a facial information recognition optimization method and a facial information recognition optimization system, which relate to the technical field of computer vision and comprise the steps of obtaining facial videos of target characters and video editing contents input by users; the method comprises the steps of processing facial videos through a video depth estimator to obtain a depth map corresponding to each frame of the facial videos, mapping the facial videos and the depth map to a camera coordinate system to obtain dynamic point clouds of target persons, projecting the dynamic point clouds back to a camera plane through a perspective projection function to obtain video-mask pairs composed of rendered videos and visibility masks, and processing the video-mask pairs through a pre-trained video repair model based on video editing content to obtain 4D facial videos of the target persons. According to the invention, the video patching model is trained through the composite mask, so that the occlusion patching function and the editing patching function of the face of the target person are realized, and the face recognition precision of the target person is greatly improved.

Inventors

ZHANG XINHUA
OUYANG BO

Assignees

合肥工业大学

Dates

Publication Date: 20260512
Application Date: 20260302

Claims (9)

1. A facial information recognition optimization method, the method comprising: acquiring face videos of target characters and video editing contents input by a user; Processing the face video through a video depth estimator to obtain a depth map corresponding to each frame of the face video; Mapping the face video and the depth map to a camera coordinate system to obtain a dynamic point cloud of a target person; Projecting the dynamic point cloud back to a camera plane through a perspective projection function to obtain a video-mask pair consisting of a rendered video and a visibility mask; The video-mask pair is processed through a pre-trained video patching model based on the video editing content to obtain a 4D face video of the target person, wherein the video patching model is constructed based on a diffusion model and is obtained by performing self-adaptive iterative tuning training by taking a composite mask as a supervision signal, the composite mask comprises a point cloud mask and an editing mask, the video patching model comprises time reasoning packaging, and the time reasoning packaging step comprises the following steps: Calculating a repair area of the 4D video, extracting a selection frame, and obtaining a new rendered video through new camera view angle reasoning; Encoding the selection frame and the new rendered video through a pre-trained 3D-VAE to obtain a selection frame tokens and a hole video token; And splicing the selection frame tokens and the cavity video token along the time dimension to obtain the joint input Xinput of the video patching model.
2. The face information recognition optimization method of claim 1, wherein the processing the video-mask pairs through a pre-trained video patch model based on the video editing content to obtain the 4D face video of the target person comprises: Selecting video frames and video areas based on the video editing content to obtain corresponding selection frames and editing areas; editing an editing area in the selection frame based on the video editing content, and guiding the video generation of the video patching model to the subsequent frame by taking the edited selection frame as a guiding frame.
3. The face information recognition optimization method according to claim 1, wherein the point cloud mask construction method includes: acquiring training videos input by a user; Processing the training video to obtain a training video-mask pair consisting of a training rendering video and a training visibility mask; The training rendering video is back projected to obtain training dynamic point cloud; and processing the training dynamic point cloud by applying inverse transformation to obtain a new video and a point cloud mask.
4. The face information recognition optimization method of claim 3, wherein the adaptive iterative tuning comprises: Step S210, performing one-time fine adjustment on a video patch model by using a low-rank adaptive technique LORA based on the training video-mask pair to obtain updated low-rank parameters; Step S220, processing training video-mask pairs based on the low-rank parameters and the video patch model to obtain 4D video; step S230, increasing the camera view angle corresponding to the 4D video to obtain a new training camera view angle, and reasoning out a new training video based on the new training camera view angle; Step S240, processing the new training video to obtain a new training video-mask pair; Step S250, repeating steps S210-S240 until the new training camera view reaches the preset training camera view, stopping adaptive iterative tuning, and taking the updated low-rank parameters and video patch model as the trained video patch model.
5. The face information recognition optimization method of claim 4, wherein the adaptive iterative tuning comprises: Wherein, psi is% ) Represents a perspective projection function used to extrapolate the view angle; Representing space-time consistency mean square error loss, N representing the number of frames of a video frame, i representing a selected video frame, P representing a point cloud sequence, T representing an external reference matrix of a camera, K representing an internal reference matrix of the camera; representing a rendered video frame obtained by a perspective projection function; representing rendering video obtained in iteration of j rounds, j representing corresponding iteration round, M representing mask in iteration of j rounds; representing low rank parameter updates; Indicating the LORA weight.
6. The facial information recognition optimization method of claim 1, further comprising verifying identity consistency of the facial video and the 4D facial video using ArcFace module and Face Encoder module.
7. A face information recognition optimizing system applied to the face information recognition optimizing method according to claim 1, comprising: The data acquisition module acquires face videos of target characters and video editing contents input by a user; The first processing module is used for processing the face video through a video depth estimator to obtain a depth map corresponding to each frame of the face video; The second processing module maps the facial video and the depth map to a camera coordinate system to obtain a dynamic point cloud of the target person; The third processing module projects the dynamic point cloud back to the camera plane through a perspective projection function to obtain a video-mask pair consisting of a rendered video and a visibility mask; The video patching module is used for processing the video-mask pair through a pre-trained video patching model based on the video editing content to obtain the 4D facial video of the target person, wherein the video patching model is constructed based on a diffusion model and is obtained by performing self-adaptive iterative tuning training by taking a composite mask as a supervision signal, and the composite mask comprises a point cloud mask and an editing mask.
8. A computer-readable storage medium storing a computer program for face information recognition optimization, wherein the computer program causes a computer to execute a face information recognition optimization method according to any one of claims 1 to 6.
9. An electronic device, comprising: One or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing a facial information recognition optimization method as recited in any one of claims 1-6.

Description

Face information recognition optimization method and system Technical Field The invention relates to the technical field of computer vision, in particular to a facial information recognition optimization method and system. Background Facial expression recognition is increasingly important in the field of medical diagnostics as artificial intelligence and deep learning techniques develop. Especially for patients that cannot be actively expressed (e.g., ICU patients), automatic analysis of facial expressions can provide critical diagnostic support for the physician. In the face recognition method of the prior art, a detection camera is usually fixed on a mobile robot directly above a patient or installed in a severe ward, so as to acquire face information of the patient. However, when the patient rotates his head, this method, which relies on a fixed camera, may have dead angles for detection, and may not be able to effectively capture facial information of the patient. Disclosure of Invention (One) solving the technical problems Aiming at the defects of the prior art, the invention provides a facial information recognition optimization method and a facial information recognition optimization system, which solve the problem that the traditional detection method can not effectively capture the detection dead angle of the facial information of a patient. (II) technical scheme In order to achieve the above purpose, the invention is realized by the following technical scheme: In a first aspect, the present invention provides a face information recognition optimization method, including: acquiring face videos of target characters and video editing contents input by a user; Processing the face video through a video depth estimator to obtain a depth map corresponding to each frame of the face video; Mapping the face video and the depth map to a camera coordinate system to obtain a dynamic point cloud of a target person; Projecting the dynamic point cloud back to a camera plane through a perspective projection function to obtain a video-mask pair consisting of a rendered video and a visibility mask; And processing the video-mask pair through a pre-trained video patch model based on the video editing content to obtain the 4D facial video of the target person, wherein the video patch model is constructed based on a diffusion model and is obtained by performing self-adaptive iterative tuning training by taking a composite mask as a supervision signal, and the composite mask comprises a point cloud mask and an editing mask. Preferably, the processing the video-mask pair through a pre-trained video patch model based on the video editing content to obtain a 4D face video of the target person includes: Selecting video frames and video areas based on the video editing content to obtain corresponding selection frames and editing areas; editing an editing area in the selection frame based on the video editing content, and guiding the video generation of the video patching model to the subsequent frame by taking the edited selection frame as a guiding frame. Preferably, the method for constructing the point cloud mask includes: acquiring training videos input by a user; Processing the training video to obtain a training video-mask pair consisting of a training rendering video and a training visibility mask; The training rendering video is back projected to obtain training dynamic point cloud; and processing the training dynamic point cloud by applying inverse transformation to obtain a new video and a point cloud mask. Preferably, the adaptive iterative tuning includes: Step S210, performing one-time fine adjustment on a video patch model by using a low-rank adaptive technique LORA based on the training video-mask pair to obtain updated low-rank parameters; Step S220, processing training video-mask pairs based on the low-rank parameters and the video patch model to obtain 4D video; step S230, increasing the camera view angle corresponding to the 4D video to obtain a new training camera view angle, and reasoning out a new training video based on the new training camera view angle; Step S240, processing the new training video to obtain a new training video-mask pair; Step S250, repeating steps S210-S240 until the new training camera view reaches the preset training camera view, stopping adaptive iterative tuning, and taking the updated low-rank parameters and video patch model as the trained video patch model. Preferably, the adaptive iterative tuning includes: Wherein, psi is% ) Represents a perspective projection function used to extrapolate the view angle; Representing space-time consistency mean square error loss, N representing the number of frames of a video frame, i representing a selected video frame, P representing a point cloud sequence, T representing an external reference matrix of a camera, K representing an internal reference matrix of the camera; representing a rendered video frame obtained by a perspective projec