CN-121985198-A - Emotion speaking head video generation method, device, equipment and medium

CN121985198ACN 121985198 ACN121985198 ACN 121985198ACN-121985198-A

Abstract

The invention relates to the technical field of artificial intelligence, which can be applied to the fields of financial science and technology and medical health, and discloses a method, a device, equipment and a medium for generating emotion voice head video, wherein the method comprises the steps of obtaining driving audio and identity images, and encoding the driving audio by utilizing a pre-trained audio encoder to obtain audio emotion vectors; the method comprises the steps of processing an identity image and an audio emotion vector based on an emotion face representation model to obtain facial identity characteristics, expression characteristics and emotion intensity values, generating a speaking head video segment through a pre-trained generation model according to the facial identity characteristics, the expression characteristics and the emotion intensity values, and generating an emotion speaking head video according to the speaking head video segment by adopting a time extrapolation strategy. And the emotion expression accuracy and the sense of reality of emotion speaker video generation are improved.

Inventors

WANG JIANZONG
ZHANG XULONG
LU RENJIE

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20260116

Claims (10)

1. The emotion speaking head video generation method is characterized by comprising the following steps of: acquiring driving audio and identity images, and encoding the driving audio by utilizing a pre-trained audio encoder to obtain audio emotion vectors; processing the identity image and the audio emotion vector based on an emotion face representation model to obtain facial identity characteristics, expression characteristics and emotion intensity values; generating a speaking head video segment through a pre-trained generation model according to the facial identity characteristics, the expression characteristics and the emotion intensity values; and generating emotion talking head video by adopting a time extrapolation strategy according to the talking head video fragment.
2. The emotion voice header video generation method of claim 1, wherein said step of encoding said driving audio with a pre-trained audio encoder to obtain an audio emotion vector comprises: inputting the processed driving audio to the audio encoder so as to enable a feature extraction network in the audio encoder to extract semantic feature vectors; And mapping the semantic feature vector into a preset emotion feature space of the audio encoder through a projection layer or an adapter network so as to output the audio emotion vector.
3. The emotion voice call head video generation method of claim 1, wherein said emotion face representation model includes a neutral encoder and a recurrent neural network, and said step of processing said identity image and said audio emotion vector based on emotion face representation model to obtain facial identity, expression and emotion intensity values includes: Eliminating emotion expression components in the identity image through the neutral encoder to obtain the facial identity characteristics irrelevant to emotion; And decoding the audio emotion vector by using the regression neural network supervised by the 2D continuous emotion label to obtain the expression characteristic and the emotion intensity value.
4. A emotion voice call head video generation method as defined in claim 3, wherein said step of eliminating emotion expression components in said identity image by said neutral encoder to obtain said facial identity characteristics independent of emotion comprises: inputting the identity image into an image encoder to obtain a mixed feature vector fused with identity information and expression information; And inputting the mixed feature vector to the neutral encoder for processing so as to separate the identity information and the expression information in the mixed feature vector, and removing the expression information to obtain the facial identity.
5. The emotion voice head video generation method of claim 1, wherein said step of generating a voice head video fragment from said facial identity, said expressive features, and said emotion intensity values by a pre-trained generation model comprises: Fusing the facial identity features, the expression features and the emotion intensity values to obtain a conditional feature vector; Inputting the condition vector into the generation model so that the generation model generates a face image frame under the control of the condition feature vector; and combining the face image frames in sequence to generate the voice header video clip.
6. The emotion voice header video generation method of claim 1, wherein said generating emotion voice header video using time extrapolation strategy from said voice header video segment comprises: Setting the initial segment length as a first preset frame number and the sliding generation length as a second preset frame number, wherein the second preset frame number is smaller than the first preset frame number; generating a video segment with the first preset frame number according to the header video segment; Taking the last frame of the first preset frame number minus the second preset frame number in the video segment as a context, and iteratively generating a new video segment of the next frame of the second preset frame number; And smoothly splicing the overlapping part of the video segment and the new video segment with the first preset frame number minus the second preset frame number to obtain the emotion talking head video.
7. The emotion voice header video generation method of any one of claims 1-6, further comprising, after said step of generating emotion voice header video using a time extrapolation policy from said voice header video segments: automatically generating a time mask based on identity coordinates corresponding to the facial identity; and regenerating or modifying the lip area and/or the eye area in the emotion voice head video for a specified period by utilizing the time mask to obtain a target emotion voice head video.
8. An emotion speaker video generation device, comprising: The acquisition encoding unit is used for acquiring driving audio and identity images, and encoding the driving audio by utilizing a pre-trained audio encoder to obtain audio emotion vectors; the processing unit is used for processing the identity image and the audio emotion vector based on the emotion face representation model to obtain face identity characteristics, expression characteristics and emotion intensity values; the first generation unit is used for generating a speaking head video segment through a pre-trained generation model according to the facial identity characteristics, the expression characteristics and the emotion intensity values; and the second generation unit is used for generating emotion talking head videos by adopting a time extrapolation strategy according to the talking head video segments.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the emotion voice video generation method of any of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the emotion voice video generation method of any one of claims 1 to 7.

Description

Emotion speaking head video generation method, device, equipment and medium Technical Field The invention relates to the technical field of artificial intelligence, which can be applied to the fields of financial science and technology and medical health, in particular to a method, a device, equipment and a medium for generating emotion talking head video. Background In recent years, audio-driven speaker head generation technology has made remarkable progress in multimedia applications, and is widely used in the fields of movie production, meta-universe, financial science and technology (e.g., intelligent emotion interaction of virtual client manager), medical health (e.g., remote mental health coaching and rehabilitation training), and the like. However, the conventional method still has obvious limitations in emotion expression, namely, most of the conventional method only can generate neutral lip movement, neglects importance of emotion interaction, and causes that the generated facial expression seriously depends on the inherent emotion state in an input image and lacks dynamic change, and on the other hand, although part of emotion perception methods try to introduce emotion information, the method still depends on external emotion reference videos or discrete emotion labels, so that input deviation and emotion intensity saturation are caused, and the emotion expression of the generated video is inconsistent with emotion rhythm contained in audio, or the expression is single and lacks natural change. In addition, due to the defects of the current data set in the emotion type and strength labeling, the model is difficult to learn fine and diversified emotion expression, and the naturalness and expressive force of the generated video are further restricted. Disclosure of Invention The invention provides a method, a device, computer equipment and a medium for generating emotion voice head video, which are used for solving the technical problem of low emotion expression accuracy of the existing voice head video generation. In a first aspect, a method for generating emotion voice header video is provided, including: acquiring driving audio and identity images, and encoding the driving audio by utilizing a pre-trained audio encoder to obtain audio emotion vectors; processing the identity image and the audio emotion vector based on an emotion face representation model to obtain facial identity characteristics, expression characteristics and emotion intensity values; generating a speaking head video segment through a pre-trained generation model according to the facial identity characteristics, the expression characteristics and the emotion intensity values; and generating emotion talking head video by adopting a time extrapolation strategy according to the talking head video fragment. In a second aspect, there is provided an emotion voice call head video generation device, including: The acquisition encoding unit is used for acquiring driving audio and identity images, and encoding the driving audio by utilizing a pre-trained audio encoder to obtain audio emotion vectors; the processing unit is used for processing the identity image and the audio emotion vector based on the emotion face representation model to obtain face identity characteristics, expression characteristics and emotion intensity values; the first generation unit is used for generating a speaking head video segment through a pre-trained generation model according to the facial identity characteristics, the expression characteristics and the emotion intensity values; and the second generation unit is used for generating emotion talking head videos by adopting a time extrapolation strategy according to the talking head video segments. In a third aspect, a computer device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the emotion voice header video generation method described above when executing the computer program. In a fourth aspect, a computer readable storage medium is provided, where a computer program is stored, and the computer program when executed by a processor implements the steps of the emotion voice header video generation method described above. According to the scheme, driving audio and identity images can be obtained, the driving audio is encoded by a pre-trained audio encoder to obtain audio emotion vectors, the identity images and the audio emotion vectors are processed based on an emotion face representation model to obtain face identity features, expression features and emotion intensity values, a speaking head video fragment is generated through a pre-trained generation model according to the face identity features, the expression features and the emotion intensity values, and emotion speaking head videos are generated according to the speaking head video fragment by adopting a time extrapolation strategy. Acc