CN-121999100-A - Multi-modal expression generating system and dynamic optimization method under virtual-real fusion scene

CN121999100ACN 121999100 ACN121999100 ACN 121999100ACN-121999100-A

Abstract

The invention belongs to the technical field of computer graphics and man-machine interaction, and particularly discloses a multi-modal expression generating system and a dynamic optimizing method under a virtual-real fusion scene, wherein the multi-modal expression generating system and the dynamic optimizing method comprise the steps of analyzing audio rhythm to generate a phase reference signal, driving vision acquisition and calculation, and realizing sound-picture synchronization of a microscopic time domain; calculating dense optical flow field, decoupling the facial muscle movement change vector by using the muscle movement direction template, and mapping the audio acoustic characteristics into audio driving values representing sounding intensity. By comparing the total amount of visual displacement with the audio driving energy, attenuation or compensation enhancement correction of energy conservation is performed on the expression vector, and a corrected expression vector is generated. And finally, the corrected expression change vector is overlapped to the state parameter of the previous frame, and a control instruction is generated to drive the real-time rendering of the virtual avatar. The invention solves the problems of timing sequence dislocation and physical constraint deletion of the sound and the picture, and improves the sense of reality and expressive force of the virtual expression.

Inventors

CHEN KUN
GAO XINLIANG
WANG XING

Assignees

深圳星火互娱数字科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260114

Claims (10)

1. The system for generating the multi-modal expression in the virtual-real fusion scene is characterized by comprising the following steps: The rhythm reference module is used for acquiring a digital audio stream, performing rhythm analysis on the digital audio stream to extract periodic characteristics and generating an audio rhythm phase reference signal comprising a time window and a sampling trigger point; the visual acquisition module synchronously acquires an original face video frame aligned with the digital audio stream, and divides a face area in the original face video frame into a plurality of expression function micro-areas based on a face anatomical structure; the feature calculation module calculates a dense optical flow vector field in a corresponding expression function micro-region according to the sampling trigger points, performs vector dot product operation on the dense optical flow vector field and a preset muscle movement direction template, and decouples and extracts a facial muscle movement change vector; the driving energy calculation module analyzes the digital audio stream to extract acoustic characteristic parameters, performs normalization processing based on statistical distribution of the acoustic characteristic parameters in a dynamic time window, and maps and converts the acoustic characteristic parameters into audio driving energy values; The energy correction module compares the geometric displacement total represented by the facial muscle movement change vector with the audio driving value, executes attenuation or compensation enhancement processing on the facial muscle movement change vector according to the comparison result, and generates a corrected expression change vector; the expression rendering module acquires the expression state parameter of the previous rendering frame of the virtual avatar, superimposes the corrected expression change vector on the expression state parameter of the previous rendering frame and performs numerical boundary constraint to generate a final expression state parameter, and converts the final expression state parameter into a control instruction stream to drive real-time rendering output of the expression of the virtual avatar.
2. The system for generating a multi-modal expression in a virtual-real fusion scene according to claim 1, wherein the performing prosody analysis on the digital audio stream to extract periodic features and generating an audio prosody phase reference signal comprising a time window and a sampling trigger point comprises: dividing the digital audio stream into short time frames having a preset length and overlapping rate; Calculating short-time energy for each short-time frame to form a short-time energy sequence, and identifying a voice active segment and a silence segment according to the short-time energy sequence; taking the time interval between the starting points of adjacent voice active segments as a periodic speech speed characteristic, and taking the length of the last effective voice active segment as a modified periodic speech speed characteristic when the time interval exceeds a preset speech speed judging threshold value; Dividing the digital audio stream into corresponding time windows based on the boundary of the periodic speech speed feature and the boundary of the periodic pause feature; For a time window determined by the periodic speech rate characteristics, defining sampling trigger points at uniform intervals according to a preset sampling density factor, wherein the interval duration is the speech rate period divided by the preset sampling density factor; Defining a sampling trigger point at each of the start and end times of the pause for a time window determined by the periodic pause feature; and integrating all time windows and sampling trigger point information in the time windows to generate structured data serving as an audio prosody phase reference signal.
3. The system for generating the multi-modal expression in the virtual-real fusion scene as set forth in claim 1, wherein the dividing the face region in the original face video frame into a plurality of expression function micro-regions based on the facial anatomy structure comprises applying a preset face detection model to the original face video frame to obtain a face bounding box; cutting out a face area image according to the face boundary frame; Applying a preset face key point detection model to the face region image to obtain a group of face feature point coordinates; according to a preset connection rule based on facial muscle anatomical structures, face feature point coordinates are connected to form a plurality of polygonal subareas, and each polygonal subarea is defined as an expression function micro-area.
4. A multimodal expression generating system in a virtual-real fusion scene according to claim 3, wherein said computing a dense optical flow vector field in the corresponding expression function micro-area from the sampling trigger points comprises: for the current sampling trigger point, acquiring a current frame surface part video frame and an expression function micro-region thereof according to a time stamp of the current sampling trigger point; Acquiring a video frame corresponding to a previous sampling trigger point as a reference frame; For each expression function micro-region, extracting corresponding image regions from a reference frame and a current frame; And (3) for each pixel point in each expression function micro-region, calculating a two-dimensional displacement vector from the reference frame to the current frame by applying a preset optical flow estimation algorithm, and generating a dense optical flow vector field corresponding to the expression function micro-region.
5. The system for generating a multimodal expression in a virtual-real fusion scene according to claim 4, wherein said decoupling extracts facial muscle movement change vectors, comprising: Calculating the included angle co-linear degree and the orthogonality degree of the dense optical flow vector field at each pixel point and a muscle movement direction template based on a vector dot product operation formula, wherein the muscle movement direction template is a unit vector field which is preset for each expression function micro-area and describes the theoretical movement direction of the skin surface during the micro-area muscle contraction; comparing the included angle co-linear degree of the pixel points with a preset first threshold value, and comparing the orthogonality degree of the pixel points with a preset second threshold value; If the collineation degree of the pixel point is higher than a first threshold value and the orthogonality degree is lower than a second threshold value, determining that the motion of the pixel point is caused by the facial muscle motion, and bringing the optical flow vector into an effective set; And carrying out weighted average on all optical flow vectors in the active set in each expression function micro-area to obtain the facial muscle movement change vector of the micro-area.
6. The system for generating a multi-modal expression in a virtual-real fusion scene according to claim 1, wherein the mapping and converting the acoustic feature parameters into audio driving energy values comprises: Extracting short-time volume and short-time pitch of the digital audio stream in a corresponding time period; acquiring a volume extremum and a pitch average value of a sound signal in a sliding window of a preset duration before the current moment, and respectively normalizing the short-time volume and the short-time pitch according to the volume extremum and the pitch average value to obtain a normalized volume intensity sequence and a normalized pitch change rate sequence; According to a preset weighting coefficient, linearly combining the normalized volume intensity sequence and the normalized pitch change rate sequence to obtain a preliminary energy scalar sequence; and summing the preliminary energy scalar sequence in a time period, and multiplying the preliminary energy scalar sequence by a preset dimension adaptive scaling factor to obtain an audio drive value.
7. The system for generating a multi-modal expression in a virtual-real fusion scene according to claim 6, wherein the performing attenuation or compensation enhancement processing on the facial muscle movement change vector according to the comparison result to generate a corrected expression change vector comprises: If the total geometric displacement exceeds the product of the audio driving value and a preset upper limit proportionality coefficient, calculating a unified attenuation factor according to the ratio of the audio driving value and the preset upper limit proportionality coefficient, and multiplying all facial muscle movement change vectors by the attenuation factor; if the total geometric displacement is lower than the product of the audio driving value and a preset lower limit proportionality coefficient, calculating the residual energy budget; According to a preset distribution weight table, distributing the residual energy budget to facial muscle movement change vectors corresponding to each expression function micro-area; Based on the assigned energy value for each micro-region and the direction of its facial muscle movement-change vector, an enhancement vector is generated and added to the original vector.
8. The system of claim 7, wherein the superimposing the corrected expression change vector onto the expression state parameter of the previous rendering frame and performing numerical boundary constraints to generate the final expression state parameter comprises: according to a preset first mapping table, mapping the corrected expression change vector of each expression function micro-region into increment transformation parameters of corresponding facial bones or weight increment of corresponding mixed shapes; If the avatar adopts a skeleton animation model, performing matrix or quaternion multiplication operation on the incremental transformation parameters and the last frame transformation state of the skeleton to generate current frame skeleton transformation; If the avatar adopts the mixed shape model, adding the weight increment and the mixed shape weight value of the previous frame to generate the mixed shape weight of the current frame, and executing numerical truncation of the [0,1] interval on the mixed shape weight of the current frame to prevent accumulated drift; and integrating all updated bone transformation or mixed shape weights to form final expression state parameters.
9. The system for generating a multi-modal expression in a virtual-real fusion scene according to claim 1, wherein the converting the final expression state parameters into a control instruction stream to drive the real-time rendering output of the virtualized expression comprises: after generating the final expression state parameters, acquiring a final expression state parameter sequence of a history frame in a preset time window; Applying Kalman filtering to the final expression state parameter sequence of the history frame to predict the smooth expression state parameter of the current frame; according to a preset second mapping table, associating the audio driving value with the virtual avatar style tag selected by the current user to generate a style enhancement coefficient; And carrying out weighted fusion on the smooth expression state parameters and the corrected expression change vectors adjusted based on the style enhancement coefficients to generate stylized expression state parameters, and replacing the final expression state parameters with the stylized expression state parameters to generate a control instruction stream.
10. A method for performing dynamic optimization of a multi-modal expression generation system in a virtual-real fusion scenario according to any one of claims 1-9, comprising: S1, acquiring a digital audio stream, performing prosody analysis on the digital audio stream to extract periodic characteristics, and generating an audio prosody phase reference signal comprising a time window and a sampling trigger point; S2, synchronously acquiring an original face video frame aligned with the digital audio stream, and dividing a face area in the original face video frame into a plurality of expression function micro-areas based on a face anatomical structure; S3, calculating a dense optical flow vector field in a corresponding expression function micro-region according to the sampling trigger points, carrying out vector dot product operation on the dense optical flow vector field and a preset muscle movement direction template, and decoupling and extracting a facial muscle movement change vector; S4, analyzing the digital audio stream to extract acoustic characteristic parameters, carrying out normalization processing based on statistical distribution of the acoustic characteristic parameters in a dynamic time window, and mapping and converting the acoustic characteristic parameters into audio driving values; s5, comparing the geometric displacement total represented by the facial muscle movement change vector with an audio driving value, and executing attenuation or compensation enhancement processing on the facial muscle movement change vector according to a comparison result to generate a corrected expression change vector; S6, acquiring the expression state parameters of the previous rendering frame of the virtual avatar, superposing the corrected expression change vector on the expression state parameters of the previous rendering frame, executing numerical boundary constraint to generate final expression state parameters, and converting the final expression state parameters into a control instruction stream to drive real-time rendering output of the virtual expression.

Description

Multi-modal expression generating system and dynamic optimization method under virtual-real fusion scene Technical Field The invention belongs to the technical field of computer graphics and man-machine interaction, and relates to a multi-modal expression generating system and a dynamic optimization method under a virtual-real fusion scene. Background In virtual-real fusion application scenes such as metauniverse social contact, virtual anchor, remote conference, digital person customer service and the like, driving a virtual avatar to generate vivid and natural real-time expression is a key link for improving the immersion of a user. Existing expression driving techniques are generally classified into a single audio driving and a single visual driving. The audio driving scheme mainly focuses on mouth shape synchronization, and lip movement is predicted by analyzing voice signals, but emotion changes of non-mouth areas such as eyebrows and the like are difficult to infer, so that the face of the avatar is stiff and sound is not loved. The visual driving scheme relies on a camera to capture the facial motion of a user, and is directly mapped to a virtual model through a face key point detection or face capture algorithm. In the prior art, for example, chinese patent publication No. CN120318890a discloses an intelligent facial motion unit-based expression generating method and system, generally, an audio signal is first processed into discrete linguistic symbols such as phoneme codes, and then these symbols are input into a pre-trained expression predictor to generate parameters of a facial motion unit. The focus of this approach is to establish a semantic mapping between the speech content and the facial expression. Another chinese patent publication No. CN120259501a discloses a method for generating a high-fidelity audio-driven character expression based on a multi-modal large model feedback mechanism, which uses a deep neural network to directly extract a multi-dimensional representation from the acoustic features of the audio, and drives a three-dimensional face geometric model to generate an initial expression. In order to make up for the defect of the initial generation effect in sense of reality and naturalness, the method further introduces a strong multi-mode large model as an external feedback mechanism, and carries out iterative optimization on the generated expression sequence so as to improve the fidelity of the final effect. It can be seen that the exploration of the high-fidelity audio-driven expression generating method in the industry mainly depends on a data-driven mapping model, and the generation quality is improved by introducing more complex feature representations or more powerful external optimizers. However, the methods have two general inherent defects, namely, firstly, the lack of a synchronization mechanism of audio rhythm and visual motion in a microcosmic time domain causes dislocation of generated expression rhythm and rhythm characteristics of voice in time sequence, and thus, synchronization of sound and pictures in a microcosmic level is difficult to realize, secondly, the generation process lacks inherent restraint on a sounding physical process, and causes mismatching of motion amplitude of expression and voice energy and insufficient physical reality. Disclosure of Invention In order to overcome the defects in the prior art, the invention provides a multi-modal expression generating system under a virtual-real fusion scene, which comprises a rhythm reference module, a sampling trigger point and a sampling trigger point, wherein the rhythm reference module acquires a digital audio stream, performs rhythm analysis on the digital audio stream to extract periodic characteristics and generates an audio rhythm phase reference signal comprising a time window and the sampling trigger point. And the visual acquisition module synchronously acquires an original face video frame aligned with the digital audio stream, and divides a face area in the original face video frame into a plurality of expression function micro-areas based on a facial anatomy structure. And the feature calculation module calculates a dense optical flow vector field in the corresponding expression functional micro-area according to the sampling trigger points, performs vector dot product operation on the dense optical flow vector field and a preset muscle movement direction template, and decouples and extracts a facial muscle movement change vector. And the driving energy calculation module analyzes the digital audio stream to extract acoustic characteristic parameters, performs normalization processing based on the statistical distribution of the acoustic characteristic parameters in a dynamic time window, and maps and converts the acoustic characteristic parameters into audio driving energy values. And the energy correction module compares the geometric displacement total represented by the facial muscle movem