CN-122023612-A - Digital human video generation method and system based on reinforcement learning

CN122023612ACN 122023612 ACN122023612 ACN 122023612ACN-122023612-A

Abstract

The invention discloses a method and a system for generating digital human video based on reinforcement learning, which relate to the technical field of video generation and solve the problems of action style deletion, emotion dyssynchrony and multi-mode cutting in the prior art by introducing a style characteristic extraction, voice emotion vector modeling and unified action control mechanism. The method comprises the steps of extracting the appearance, tone, action, expression and mouth shape style of a person from a reference video, injecting the appearance style into a three-dimensional digital human model, generating voice according to the text and tone color style, extracting voice emotion vectors simultaneously, using the voice emotion vectors as unified control signals for driving whole body action, expression and mouth shape sequence generation, ensuring consistent style and emotion synchronization, and finally realizing digital human video output with high reduction degree and high consistency through unified time sequence alignment and micro rendering.

Inventors

LI YANG
LV XUDONG

Assignees

合肥工业大学

Dates

Publication Date: 20260512
Application Date: 20260128

Claims (10)

1. The method for generating the digital human video based on reinforcement learning is characterized by comprising the following steps of: Extracting style characteristics of a target video according to a preset style, wherein the style characteristics comprise appearance style characteristics, tone style characteristics, action style characteristics, expression style characteristics and mouth shape style characteristics; importing the appearance style characteristics into a three-dimensional digital human model to obtain a target three-dimensional digital human model; Generating audio according to text lines and voice and color style characteristics input by a user, extracting audio information to generate a voice emotion vector, generating an initial action sequence based on the voice emotion vector, and optimizing the initial action sequence to obtain an optimized action sequence, wherein the action sequence comprises an expression sequence, a mouth shape sequence and a whole body action sequence; And (3) carrying out unified time sequence alignment on the optimized action sequence, and driving the target three-dimensional digital human model to execute micro-rendering to generate a digital human video.
2. The reinforcement learning-based digital human video generation method according to claim 1, wherein the step of generating audio from text lines and tone style features input by a user comprises: the tone style characteristics comprise pitch, pronunciation mode, intonation lifting, speech speed and rhythm; And converting the text speech inputted by the user into the speech of the target character, wherein the generated speech has consistent style characteristics and tone style characteristics, and outputting audio.
3. The reinforcement learning-based digital human video generation method according to claim 2, wherein the steps of extracting audio information to generate a speech emotion vector and generating an initial motion sequence based on the speech emotion vector are: extracting rhythm characteristics, tone changes and pause time sequences in the audio to generate a voice emotion vector of the target person; Generating an initial expression sequence through Blendshape weight models, and generating a mouth shape sequence synchronous with voice content through a phoneme-to-mouth shape mapping model; And taking the initial whole body action sequence, the initial expression sequence and the initial mouth shape sequence as initial action sequences.
4. The reinforcement learning based digital human video generation method of claim 1, wherein the step of optimizing the initial motion sequence to obtain an optimized motion sequence comprises: According to the action style characteristics, the expression style characteristics and the mouth style characteristics, adopting a near-end strategy optimization algorithm to continuously and iteratively optimize the generated initial action sequence, calculating an optimization stopping index according to state information in the iteration process, comparing the optimization stopping index with a preset threshold, judging whether to stop iterative optimization according to a comparison result, and outputting the optimized action sequence.
5. The reinforcement learning based digital human video generation method of claim 4, wherein the step of calculating the optimization stop index according to the state information in the iterative process is: And calculating a style residual error disturbance inversion index and a phase cooperative stability index according to the state information in the iteration process, and calculating an optimization stop index according to the style residual error disturbance inversion index and the phase cooperative stability index.
6. The reinforcement learning-based digital human video generation method according to claim 5, wherein the style residual disturbance inversion index calculating step is: In each round of iteration process, respectively representing the action sequence and the corresponding style characteristic of the corresponding round as vectors, dividing the dot product of the action sequence and the corresponding style characteristic by the respective modular length product to obtain cosine similarity, subtracting the cosine similarity value from 1 to obtain the style residual value of the current corresponding round, and forming the style residual sequence arranged according to the round; Calculating a difference value sign between any two adjacent rounds aiming at the style residual sequence, if the difference value sign is larger than the disturbance direction sign, marking the disturbance direction sign as positive, otherwise marking the disturbance direction sign as negative, and obtaining a disturbance direction sign sequence; in the disturbance direction symbols of every three successive rounds, if the former group of symbols are opposite to the middle group of symbols and the middle group of symbols are opposite to the latter group of symbols, the three rounds are determined to form a disturbance reversal skip point, and a reversal mark is marked on the rounds to generate a disturbance reversal skip point sequence; Counting the frequency of disturbance inversion jumping points in all rounds to obtain a disturbance inversion density value, wherein the disturbance inversion density value is equal to the proportion of the number of rounds with inversion marks to the total rounds; extracting the turn indexes of all the disturbance inversion jumping points, and sequentially calculating the turn interval length between any two continuous jumping points to form a disturbance inversion period sequence; For the disturbance inversion period sequence, calculating the relative change rate of adjacent period lengths, namely dividing the difference between each period length and the previous period length by the absolute value of the previous period length plus one, and averaging the change rate according to the turn to obtain the fluctuation stability factor of the disturbance inversion period; Multiplying the disturbance inversion density value by a fluctuation stabilizing factor to obtain a numerator, and adding 1 to obtain a denominator to obtain a style residual disturbance inversion index.
7. The reinforcement learning-based digital human video generation method according to claim 5, wherein the phase cooperative stability index calculating step is: In each iteration process, aligning the motion sequence, the expression sequence and the voice emotion vector sequence generated by corresponding rounds to the same frame length of the three types of modal signals, and constructing a motion time sequence, an expression time sequence and an emotion time sequence in a frame sequence form; respectively executing fast Fourier transform operation on the three types of time sequences, extracting frequency components of each type of signals, selecting a plurality of first main frequency components from frequency amplitude components of the signals, extracting the maximum frequency amplitude value of the signals, and constructing a frequency envelope feature sequence; For the frequency envelope feature sequence of each type of modal signal, identifying the position of a main frequency peak point in the frame sequence, and respectively obtaining an action main peak frame index sequence, an expression main peak frame index sequence and an emotion main peak frame index sequence; Based on the frame index positions of the main frequency peak points of the three modes in each round, respectively calculating an inter-frame offset sequence between the motion and the emotion, an inter-frame offset sequence between the expression and the emotion and an inter-frame offset sequence between the motion and the expression, wherein each frame offset is an absolute value of the difference between the frame positions of the main peaks of the corresponding two modes; And calculating a phase cooperative stability index for the absolute value of the difference between the frame positions of the corresponding two modes of main peaks according to each frame offset.
8. The reinforcement learning-based digital human video generation method according to claim 7, wherein the step of calculating the phase cooperative stability index for the absolute value of the difference between the positions of the corresponding two modal main peak frames according to each frame offset is: for each inter-frame offset sequence, calculating the absolute value of the difference between two adjacent inter-frame offset values in sequence, and dividing the difference by the value obtained by adding one to the previous inter-frame offset value to obtain a corresponding relative disturbance change rate sequence; Counting the number of elements larger than the median of each group of disturbance change rate sequences, dividing the number by the total length of the group, and subtracting one to obtain three groups of disturbance density values which respectively represent rhythm cooperative disturbance density values between actions and emotions, between expressions and emotions and between actions and expressions; selecting the largest one from the three groups of rhythm cooperative disturbance density values as a cross-modal rhythm cooperative disturbance factor in the training process of the round; taking 1 as a constant molecule, adding the disturbance factor to 1 as a denominator, and constructing a partial expression form, wherein the obtained numerical value is the phase cooperative stability index.
9. The reinforcement learning-based digital human video generation method of claim 4, wherein the steps of comparing the optimization stop index with a preset threshold, judging whether to stop iterative optimization according to the comparison result, and outputting the optimized action sequence and expression sequence as the multi-mode driving signal are as follows: if the optimization stopping index is not smaller than the preset threshold, stopping iterative optimization, and outputting an optimized action sequence and expression sequence as a multi-mode driving signal; If the optimization stopping index is smaller than the preset threshold, the iterative optimization is not stopped, and the iterative optimization is not stopped until the optimization stopping index is not smaller than the preset threshold, and the optimized action sequence and expression sequence are output as the multi-mode driving signals.
10. A reinforcement learning based digital human video generation system for implementing the reinforcement learning based digital human video generation method of any one of the above claims 1-9, characterized in that the system comprises: the extraction module is used for extracting style characteristics of a target video according to a preset style type, wherein the style characteristics comprise appearance style characteristics, tone style characteristics, action style characteristics, expression style characteristics and mouth style characteristics; the model module is used for importing the appearance style characteristics into the three-dimensional digital human model to obtain a target three-dimensional digital human model; the optimizing module is used for generating audio according to text speech and tone style characteristics input by a user, extracting audio information to generate a voice emotion vector, generating an initial action sequence based on the voice emotion vector, and optimizing the initial action sequence to obtain an optimized action sequence, wherein the action sequence comprises an expression sequence, a mouth shape sequence and a whole body action sequence; And the generation module is used for carrying out unified time sequence alignment on the optimized action sequence and driving the target three-dimensional digital human model to execute micro-rendering to generate digital human video.

Description

Digital human video generation method and system based on reinforcement learning Technical Field The invention relates to the technical field of video generation, in particular to a method and a system for generating digital human video based on reinforcement learning. Background Currently, technologies for generating digital human video based on multimodal information mainly include an image or video driving method, an audio driving method, and a text driving large model generating method developed in recent years. The image/video driving technology generally extracts key points, optical flows or Face grids from an input video or a photo, and generates a new video output by utilizing an action migration and rendering network, typical methods comprise FOMM, faceReenactment, AD-NeRF and the like, but the technology relies on additional driving video, can not flexibly generate actions according to texts or audios, can not realize independent control of action styles and expression styles, audio driving technology such as Wav2Lip, sadTalker, audio2Face predicts mouth shapes and partial expressions through audio features so as to realize the generation of the speaking video of a person, but can only drive head expressions and has limited action dimensions, and cannot perform stylized control, and a text-driven multi-modal large-model method generates voices through texts, regenerates expressions and actions, but the actions generated by the prior art are generally universal, can not embody the personality styles in the original video of the person, and meanwhile, emotion control is unstable and lacks a consistent multi-modal fusion mechanism. In general, the prior art either relies on driving video, lacks flexibility, or fails to achieve personalized motion style control, or lacks emotion vector modeling, and still has difficulty in meeting the requirements of high-consistency digital human video generation. From the main flow of digital human video generation, the prior art has obvious defects in the aspects of visual style feature construction, tone style modeling, action generation and multi-modal fusion. The method comprises the steps of enabling an image/video driving technology to keep higher appearance consistency, enabling the motion to be generated from text or Audio autonomously, enabling an Audio driving model to generate mouth shapes and limited expressions, enabling the Audio driving model to lack modeling capability of motion styles, enabling a current TTS and voice cloning model to generate target tone voices, enabling most of the current TTS and voice cloning models to not provide explicit emotion vectors capable of being used for motion and expression control, enabling voice emotion to be unified with video emotion, enabling models such as T2M-GPT, audio2Gesture and the like to generate motion from text or Audio, enabling the motion to be generalized universally, enabling the motion rhythm and behavior styles of a person to be unable to be learned and migrated, enabling facial expressions and body motions to be unable to be unified, enabling the traditional multi-mode generating chain to generate mouth shapes, expressions and motions in a scattered mode, and lacking unified modeling and time sequence alignment mechanisms, and enabling finally synthesized videos to be inconsistent in emotion, style, time sequence and the like. Therefore, although the existing research lays a foundation for digital human video generation, the problems of the whole technology chain cutting, non-uniform actions and expressions and missing figure styles are still outstanding. Disclosure of Invention The invention aims to solve the problems that the prior art mentioned in the background art is broken in the whole technical chain, the actions and the expressions are not uniform, and the figure style is lost, and provides a method and a system for generating the digital human video based on reinforcement learning. In a first aspect of the present invention, there is provided a method for generating a digital human video based on reinforcement learning, the method comprising: S1, extracting style characteristics of a target video according to a preset style, wherein the style characteristics comprise appearance style characteristics, tone style characteristics, action style characteristics, expression style characteristics and mouth style characteristics; S2, importing appearance style characteristics into a three-dimensional digital human model to obtain a target three-dimensional digital human model; S3, generating audio according to text speech and tone style characteristics input by a user, extracting audio information to generate a voice emotion vector, generating an initial action sequence based on the voice emotion vector, and optimizing the initial action sequence to obtain an optimized action sequence, wherein the action sequence comprises an expression sequence, a mouth shape sequence and a whole body action s