CN-122024756-A - Real-time voice-driven human body posture generation method based on variation self-encoder

CN122024756ACN 122024756 ACN122024756 ACN 122024756ACN-122024756-A

Abstract

The invention discloses a real-time voice-driven human body posture generation method based on a variation self-encoder, which belongs to the technical field of digital human interaction and comprises the following steps of S1, preprocessing and aligning multi-mode data, constructing a three-dimensional training sample comprising audio characteristics, emotion labels and posture data, S2, constructing a cross-mode generation model based on the variation self-encoder, wherein the cross-mode generation model comprises an encoder and a decoder, S3, designing a multi-objective loss function to perform training optimization on the cross-mode generation model, S4, inputting the preprocessed audio to the trained cross-mode generation model, and obtaining a time sequence coherent human body posture sequence through real-time reasoning. The variation self-encoder is applied to a voice-human body gesture cross-modal generation scene, the traditional application boundary of the variation self-encoder is broken, and the complex mapping relation between the audio features and the human body gesture is encoded into low-dimensional probability distribution through the probability modeling capability of the potential space of the variation self-encoder, so that a new technical path is provided for gesture generation.

Inventors

ZHAO WANQING
Han Denglu
ZHANG SHAOBO
ZHANG XIANG
WANG LIN
PENG JINYE

Assignees

西北大学

Dates

Publication Date: 20260512
Application Date: 20260129

Claims (8)

1. The real-time voice-driven human body posture generation method based on the variation self-encoder is characterized by comprising the following steps of: S1, preprocessing and aligning multi-mode data to construct a three-dimensional training sample comprising audio characteristics, emotion labels and gesture data; s2, constructing a cross-modal generation model based on a variation self-encoder, wherein the cross-modal generation model comprises an encoder and a decoder; S3, designing a multi-objective loss function to train and optimize the cross-modal generation model; S4, inputting the preprocessed audio to be processed into a cross-modal generating model with training completed, and obtaining a time sequence-coherent human body posture sequence through real-time reasoning.
2. The method for generating real-time voice-driven human body gestures based on a variance-free encoder according to claim 1, wherein the preprocessing and aligning of the multi-modal data in step S1 comprises: s11, extracting audio characteristics, namely extracting Mel frequency spectrum, MFCC and beat detection characteristics from original audio; s12, normalizing gesture data, converting joint angles in BVH files into 3D coordinates, and normalizing bone lengths and coordinate systems; and S13, multi-mode alignment, and frame level synchronization of the audio features, the gesture data and the emotion labels is realized based on the time stamp.
3. The method for generating real-time voice-driven human body gestures based on the variation self-encoder according to claim 1, wherein the encoder in the step S2 adopts a composite structure for extracting audio space-time characteristics and fusing multi-modal condition information embedded with emotion tags and text semantics; The composite structure sequentially comprises a CNN layer, a bidirectional LSTM layer and a multi-head attention layer, wherein the CNN is used for extracting spatial features of audio, the bidirectional LSTM layer is used for modeling time sequence dependency relationship of the audio and extracting audio space-time features, and the multi-head attention mechanism is used for focusing audio key frames.
4. The real-time voice-driven human body posture generation method based on the variation self-encoder according to claim 1, wherein the decoder in the step S2 is designed based on the GRU and the time attention mechanism, inputs the latent variable and emotion/text embedded information of the variation self-encoder, and outputs the 3D posture sequence with consecutive time sequences.
5. The method for generating real-time voice-driven human body posture based on variation self-encoder of claim 1, wherein the multi-objective loss function in step S3 is total loss, and the calculation formula is L Total (S) =L Reconstruction +βL KL +L Synchronization +L Smoothing Where β is a KL divergence loss weight coefficient, L Reconstruction is a reconstruction loss, L KL is a KL divergence loss, L Synchronization is a synchronization loss, and L Smoothing is a smoothing loss.
6. The real-time voice-driven human body posture generation method based on the variation self-encoder is characterized in that reconstruction loss is combined with joint coordinate mean square error and joint rotation angle cosine similarity loss, KL divergence loss is trained by means of KL annealing strategies, initial training beta=0 is gradually increased to a target value along with training progress, synchronization loss is achieved by means of fastDTW algorithm, and smoothing loss is achieved by means of a differential punishment mechanism.
7. The method for generating the real-time voice-driven human body posture based on the variation self-encoder according to claim 1, further comprising converting the human body posture sequence obtained by the real-time reasoning into a visual human body posture animation through skeleton rendering, joint binding and dynamic frame refreshing.
8. The method of any one of claims 1-7, wherein training the cross-modal generation model optimizes building audio-gesture alignment samples based on a BEAT dataset comprising class 10 semantic relevance, class 8 emotional gestures, group 4 modal data, and 72-hour multilingual speech data for 30 lecturers.

Description

Real-time voice-driven human body posture generation method based on variation self-encoder Technical Field The invention relates to the technical field of digital human interaction, in particular to a real-time voice-driven human body gesture generation method based on a variation self-encoder. Background In a digital human interaction system, voice-driven human body gesture generation is one of the key links for realizing natural interaction, and the key requirement is that an input voice signal is accurately and real-time converted into a human body gesture sequence conforming to semantic and emotion characteristics. Currently, human body posture generation technologies are mainly divided into two types, namely a traditional posture generation method and a posture generation method based on deep learning. The traditional gesture generation method relies on manual key frame editing or motion capture data driving, and the method needs a great deal of manual operation by professionals, so that the efficiency is low, and the requirement of real-time interaction scenes cannot be met. With the development of deep learning technology, a gesture generating method based on deep learning is becoming mainstream, and typical representatives are Wav2Pose, spech 2Gesture, EMAGE, etc., and these methods mostly adopt a deterministic transducer framework to construct a model. However, the existing voice-driven human body posture generation technology based on deep learning still has two core technical pain points, namely firstly, the real-time performance is insufficient, the model reasoning time delay is generally over 200ms and even reaches a few seconds due to the multi-head self-attention redundancy calculation problem of a transducer frame, the real-time interaction requirement of digital people cannot be met, secondly, the posture diversity and the physical rationality are unbalanced, the existing method is difficult to consider the diversity of the generated posture and the physical rationality conforming to the human body movement rule, and the problems of joint angle violating the physical rule, action jitter incoherence and the like often occur. In addition, in the prior art, single audio features are mostly adopted as input, and multi-mode information such as emotion, semanteme and the like is not fully fused, so that the generated gesture and voice have low semantic relevance and emotion matching degree. The variational self-encoder (VAE) is used as a classical generation model, has strong potential space modeling capability, is widely applied to generation tasks in the fields of images, texts, music and the like, but no related technical scheme for applying the variational self-encoder to cross-modal generation of voice-driven human body gestures exists at present, the probability distribution modeling characteristic of the potential space is expected to solve the problems of gesture diversity and continuity, and meanwhile, the model instantaneity can be improved by adopting a lightweight time sequence structural design. Disclosure of Invention The invention aims to solve the technical problems of insufficient real-time performance, unbalanced gesture diversity and physical rationality and insufficient multi-modal information fusion in the prior art, and provides a real-time voice-driven human body gesture generation method based on a variation self-encoder. In order to solve the technical problems, the invention provides the following technical scheme: a real-time voice-driven human body posture generation method based on a variation self-encoder comprises the following steps: S1, preprocessing and aligning multi-mode data to construct a three-dimensional training sample comprising audio characteristics, emotion labels and gesture data; s2, constructing a cross-modal generation model based on a variation self-encoder, wherein the cross-modal generation model comprises an encoder and a decoder; S3, designing a multi-objective loss function to train and optimize the cross-modal generation model; S4, inputting the preprocessed audio to be processed into a cross-modal generating model with training completed, and obtaining a time sequence-coherent human body posture sequence through real-time reasoning. By adopting the technical scheme, the variable self-encoder is applied to the voice-human body gesture cross-modal generation scene, the traditional application boundary of the variable self-encoder is broken, the complex mapping relation between the audio features and the human body gesture is encoded into low-dimensional probability distribution through the probability modeling capability of the potential space of the variable self-encoder, a brand new technical path is provided for gesture generation, diversified human body gestures are generated by means of the probability sampling characteristic of the potential space of the variable self-encoder, the single problem of the prior art gesture is avoided, mean