CN-121842342-B - Virtual anchor real-time driving system based on facial motion capture

CN121842342BCN 121842342 BCN121842342 BCN 121842342BCN-121842342-B

Abstract

The invention discloses a virtual anchor real-time driving system based on facial motion capture, which aims to solve the problems of insufficient accuracy, poor natural sense and weak scene adaptation of the facial motion of a virtual anchor in the prior art, acquires audio and video streams in a multi-thread parallel manner, optimizes the data quality through self-adaptive preprocessing, adopts multi-dimensional facial feature collaborative extraction, double-branch phoneme recognition, semantic segmentation of a human face functional area and confidence dynamic weight adjustment technology to realize accurate representation and collaborative fusion of expression and mouth shape features, and finally outputs real-time driving features of an adaptive virtual image through a mouth shape-expression collaborative driving mechanism, thereby effectively improving the accuracy and the natural sense of the facial motion of the virtual anchor, enhancing the suitability of complex scenes, taking into account real-time response and low-cost deployment, being suitable for scenes such as virtual live broadcast, online education and the like, and having good application value.

Inventors

LIU XINGZHU

Assignees

贵州师范大学

Dates

Publication Date: 20260508
Application Date: 20260312

Claims (7)

1. The virtual anchor real-time driving system based on facial motion capture is characterized by comprising a processor and a memory, wherein the memory stores a computer program which is executed by the processor to realize the following steps: s1, acquiring an audio stream and a face video stream of a user in parallel, and executing computer vision preprocessing on the face video stream; S2, extracting visual features of the facial video stream to obtain dynamic visual features of a facial key region, visual features of an overall facial expression and visual semantic features of a mouth shape region; S3, carrying out phoneme recognition on the audio stream to obtain a phoneme sequence; S4, inputting the facial key region dynamic visual features, the whole facial expression visual features, the mouth shape region visual semantic features and the phoneme sequences into a facial region visual semantic segmentation and function discrimination network, and outputting a facial function region semantic segmentation map and a visual discrimination confidence map; S5, adjusting the weight of the whole facial expression visual characteristics through a visual smoothing algorithm based on the facial function region semantic segmentation map and the visual discrimination confidence map to generate driving characteristics; S6, driving the virtual image model based on the driving characteristics, and generating and outputting a virtual anchor video stream with coordinated mouth shapes and expressions; Training a neural network model by using training data with regional semantic annotation, wherein the training data comprises a speaking video containing various phonemes and expressions and labels which are manually marked in video frames and are used for indicating that the motion of a specific face region is mainly used for pronunciation or emotion expression; the facial region visual semantic segmentation and function discrimination network comprises a visual feature alignment layer, a U-Net improved segmentation layer and a confidence calculation layer; The visual feature alignment layer maps the facial key region dynamic visual features, the whole facial expression visual features, the mouth shape region visual semantic features and the phoneme sequences with different dimensions to the same feature space; the U-Net improved segmentation layer outputs function attribute labels of all areas through an encoding-decoding structure; The confidence coefficient calculating layer calculates visual confidence coefficient of each region function attribute label based on IoU loss function, and generates the facial function region semantic segmentation map and visual discrimination confidence coefficient map; The facial function region semantic segmentation map characterizes probability distribution that each small region of the human face divided is judged to be an expression leading region or a mouth shape leading region at the current moment; the visual discrimination confidence chart characterizes the degree of grasping of conflict detection and facial region visual semantic segmentation and functional discrimination network for each region judgment in the facial functional region semantic segmentation chart; The facial function area semantic segmentation map is expressed in a matrix form, each element value in the matrix corresponds to a predefined small block area in the face, and the size of each element value represents the probability value that one predefined small block area belongs to the expression leading area.
2. The virtual anchor real-time driving system based on facial motion capture as set forth in claim 1, wherein the computer vision preprocessing comprises: s101, suppressing intra-frame noise of a face video stream by adopting a Gaussian bilateral filtering algorithm, and realizing illumination compensation through self-adaptive histogram equalization; s102, positioning and clipping a facial ROI region based on MTCNN algorithm, and extracting RGB three-channel pixel data of the ROI region; S103, normalizing the RGB three-channel pixel data into a video frame with a fixed size, wherein the video frame is used as the input data of the visual characteristic extraction step.
3. The virtual anchor real-time driving system based on facial motion capture of claim 1, wherein the process of generating the driving features comprises: S501, combining and calculating the probability value of each region belonging to the expression leading region in the semantic segmentation map of the facial function region with the confidence value of the corresponding region in the visual discrimination confidence map to obtain a fusion weight coefficient of each region; S502, carrying out space alignment and feature slicing on the whole facial expression visual features according to the division of the face regions to obtain local expression feature vectors corresponding to each region; s503, multiplying the local expression feature vector of each region by the fusion weight coefficient corresponding to each region to obtain weighted local expression features; s504, the local expression features of all the areas are aggregated to generate the driving features.
4. The virtual anchor real-time driving system based on facial motion capture as set forth in claim 1, wherein the process of obtaining the phoneme sequence comprises: S301, framing and feature extraction are carried out on the audio stream, and an audio feature sequence is obtained; S302, inputting the audio feature sequence into a phoneme recognition model, and outputting the phoneme sequence, wherein the phoneme sequence represents phoneme categories and time boundary information of the phoneme categories which are arranged in time sequence.
5. The virtual anchor real-time driving system based on facial motion capture as recited in claim 4, wherein the phoneme recognition model is a dual-branch encoder structure comprising: A first encoding branch for identifying and outputting the phoneme sequence; a second encoding branch sharing part of the underlying features with the first encoding branch for extracting generalized pronunciation action features related to facial muscle movements; The generalized pronunciation action feature and the phoneme sequence are used as the input of the facial area visual semantic segmentation and function discrimination network in the S4.
6. The facial motion capture based virtual anchor real-time driving system as recited in claim 1, wherein said visual feature extraction process comprises; s201, processing the face video stream of each frame by using a face detection and alignment model to obtain a standardized face image; s202, inputting the facial image to an expression feature encoder, and extracting the whole facial expression feature; s203, cutting out the dynamic visual features of the face key areas, the visual features of the whole facial expression and the visual semantic features of the mouth-shaped areas from the face image according to the coordinates of the key points of the face.
7. The real-time driving system of a virtual anchor based on facial motion capture of claim 1, wherein the mouth shape region visual semantic features and the phoneme sequences are input into a mouth shape driving generation network to generate mouth shape driving parameters synchronous with pronunciation, and the driving features and the mouth shape driving parameters are combined to be used for driving the virtual image model together.

Description

Virtual anchor real-time driving system based on facial motion capture Technical Field The invention relates to the field of computer vision, in particular to a virtual anchor real-time driving system based on facial motion capture. Background With the rapid development of digital economy, a virtual anchor is used as a core carrier of man-machine interaction, and the technical maturity of the virtual anchor directly determines user experience and commercial landing effect. Currently, a virtual anchor real-time driving technology based on facial motion capture gradually replaces a traditional optical mark dynamic capture technology, and becomes a main technical route, and the core principle is that facial features and pronunciation related information are extracted by collecting face video streams and audio streams of a real person, so that an avatar is driven to generate synchronous expression and mouth shape actions. Currently, the mainstream schemes in the industry are mostly based on monocular cameras or depth cameras to collect data, and integrated processing of feature extraction, phoneme recognition and virtual image driving is achieved through a depth learning algorithm, for example, monocular camera real-time expression driving schemes put forward by enterprises such as Tengxun, baidu and the like are initially landed in scenes such as live broadcast electronic commerce, virtual customer service and the like. However, it has been found through investigation that there are still a number of technical bottlenecks and drawbacks with the existing virtual anchor real-time driving technology based on facial motion capture. Specifically, in the existing virtual image driving technology, the multi-mode fusion relies on static rules to coordinate expression and mouth shape conflicts, and the effect is poor. The current computer vision technology lacks dynamic adaptation capability in facial area function recognition, and is difficult to distinguish emotion expression attributes and pronunciation action attributes of each area of a human face in a speaking process in real time, so that function confusion occurs when visual features and audio features are fused, and driving naturalness is affected. Disclosure of Invention The invention overcomes the defects of the prior art and provides a virtual anchor real-time driving system based on facial motion capture. In order to achieve the above purpose, the technical scheme adopted by the invention is that the virtual anchor real-time driving system based on facial motion capture comprises a processor and a memory, wherein the memory stores a computer program, and when the computer program is executed by the processor, the following steps are realized: s1, acquiring an audio stream and a face video stream of a user in parallel, and executing computer vision preprocessing on the face video stream; S2, extracting visual features of the facial video stream to obtain dynamic visual features of a facial key region, visual features of an overall facial expression and visual semantic features of a mouth shape region; S3, carrying out phoneme recognition on the audio stream to obtain a phoneme sequence; S4, inputting the facial key region dynamic visual features, the whole facial expression visual features, the mouth shape region visual semantic features and the phoneme sequences into a facial region visual semantic segmentation and function discrimination network, and outputting a facial function region semantic segmentation map and a visual discrimination confidence map; S5, adjusting the weight of the whole facial expression visual characteristics through a visual smoothing algorithm based on the facial function region semantic segmentation map and the visual discrimination confidence map to generate driving characteristics; and S6, driving the virtual image model based on the driving characteristics, and generating and outputting a virtual anchor video stream with coordinated mouth shapes and expressions. In a preferred embodiment of the present invention, the computer vision pretreatment includes: s101, suppressing intra-frame noise of a face video stream by adopting a Gaussian bilateral filtering algorithm, and realizing illumination compensation through self-adaptive histogram equalization; s102, positioning and clipping a facial ROI region based on MTCNN algorithm, and extracting RGB three-channel pixel data of the ROI region; S103, normalizing the RGB three-channel pixel data into a video frame with a fixed size, wherein the video frame is used as the input data of the visual characteristic extraction step. In a preferred embodiment of the invention, training a neural network model by using training data with regional semantic annotation, wherein the training data comprises a speaking video containing various phonemes and expressions and labels which are manually annotated in video frames and are used for indicating that the movement of a specific facial region is