CN-115984434-B - Emotion expression human face animation generation method, emotion expression human face animation generation device and readable storage medium

CN115984434BCN 115984434 BCN115984434 BCN 115984434BCN-115984434-B

Abstract

The application provides a facial animation generation method and device for emotion expression and a readable storage medium. The emotion expression human face animation generation method comprises the steps of obtaining voice input by a user, inputting the voice into a trained expression animation generation model to output PCA coefficients of predicted three-dimensional human face expression animations, inputting the trained expression animation generation model into the expression animation generation model by using a voice sample set to train, projecting the PCA coefficients of the predicted expression animations into expression animation data of the three-dimensional human face, and redirecting the projected expression animation data to a target digital person.

Inventors

LIU YIYING
LI RONG
LI MENGJIAN

Assignees

之江实验室

Dates

Publication Date: 20260512
Application Date: 20230103

Claims (10)

1. A facial animation generation method of emotion expression is characterized by comprising the following steps: Acquiring voice input by a user; The method comprises the steps of inputting voice into a trained expression animation generation model to output PCA coefficients of predicted three-dimensional facial expression animation, inputting a trained expression animation generation model into the expression animation generation model by using a voice sample set to train, wherein the expression animation generation model comprises an encoder and a decoder, the encoder encodes voice features into potential expression input decoders, the decoders obtain the PCA coefficients of the three-dimensional facial expression animation, and the decoders comprise 4 1-dimensional time convolution layers and 3 full connection layers and are used for learning the mapping from the voice features to the PCA coefficients of the expression animation so as to obtain the PCA coefficients of the predicted three-dimensional facial expression animation; Projecting the PCA coefficient of the predicted expression animation into expression animation data of a three-dimensional face; redirecting the projected expression animation data on the target digital person; The method comprises the following steps of obtaining a trained expression animation generation model: the method comprises the steps of determining a mean square error or root mean square error between the distances of upper and lower lip key point pairs of a real three-dimensional face grid model and the distances of the upper and lower lip key point pairs of a predicted three-dimensional face grid model as a lip closing loss function, carrying out linear weighting on a vertex distance loss function and the lip closing loss function to obtain a first total loss function, adjusting training parameters of a surface animation generation model according to the first total loss function until a preset ending condition is met to obtain the trained surface animation generation model, wherein the upper and lower lip key point pairs refer to 5 key points of an upper lip inner ring and 5 corresponding key points of a lower lip inner ring.
2. The method for generating facial animation expressed in emotion of claim 1, the facial animation generation method based on the emotion expression is characterized by further comprising the following steps of: The trained expression animation generation model is obtained by the following steps: acquiring a voice sample set, wherein the voice sample set comprises vertex coordinates of a real three-dimensional face grid model, and the vertex coordinates of the real three-dimensional face grid model comprise characteristic key points of the real three-dimensional face grid model; inputting the voice sample set into an expression animation generation model to predict PCA coefficients of three-dimensional facial expressions corresponding to the voice sample set; converting the predicted PCA coefficient into vertex coordinates of a predicted three-dimensional face grid model; Comparing the difference degree of the vertex coordinates of the real three-dimensional face grid model and the predicted vertex coordinates of the three-dimensional face grid model to obtain a vertex distance loss function; And adjusting training parameters of the expression animation generation model according to the vertex distance loss function until a preset ending condition is met, so as to obtain the trained expression animation generation model.
3. The emotion expression face animation generation method of claim 2, wherein vertex coordinates of the predicted three-dimensional face mesh model include feature key points of the predicted three-dimensional face mesh model; The training parameters of the expression animation generation model are adjusted according to the vertex distance loss function until a preset end condition is met, and before the trained expression animation generation model is obtained, the emotion expression facial animation generation method further comprises the following steps: Comparing the difference degree of the characteristic key points of the real three-dimensional face grid model and the predicted characteristic key points of the three-dimensional face grid model to obtain a characteristic-related loss function; Obtaining a total loss function according to the vertex distance loss function and the characteristic-related loss function; The training parameters of the expression animation generation model are adjusted according to the vertex distance loss function until a preset end condition is met, and the trained expression animation generation model is obtained, which comprises the following steps: And adjusting training parameters of the expression animation generation model according to the total loss function until a preset ending condition is met, so as to obtain the trained expression animation generation model.
4. The method for generating the emotion expression facial animation according to claim 3, wherein the feature key points of the predicted three-dimensional facial mesh model comprise vertex displacement key points of adjacent frames of the predicted three-dimensional facial mesh model, the step of comparing the difference degree between the feature key points of the real three-dimensional facial mesh model and the feature key points of the predicted three-dimensional facial mesh model to obtain a feature-related loss function comprises determining a mean square error or a root mean square error between the vertex displacement key points of the adjacent frames of the real three-dimensional facial mesh model and the vertex displacement key points of the adjacent frames of the predicted three-dimensional facial mesh model as a time-continuous loss function, the step of obtaining a total loss function according to the vertex distance loss function and the feature-related loss function, the step of obtaining a second total loss function according to the vertex distance loss function and the time-continuous loss function, the step of adjusting training parameters of the animation generation model according to the total loss function until a preset end condition is met, the training parameters of the animation generation model are obtained, the training model is well completed according to the total loss function is obtained, the training model is well obtained, and the training model is well accomplished according to the preset expression model is well obtained.
5. The emotion expression face animation generation method of any one of claims 2 to 4, wherein, before the training parameters of the emotion expression animation generation model are adjusted according to the vertex distance loss function until a preset end condition is satisfied, the emotion expression face animation generation method further comprises: The method comprises the steps of respectively predicting emotion classification probability of a real facial animation and emotion classification probability of the predicted facial animation by using a trained facial expression classification model and a predicted PCA coefficient, wherein the trained facial expression classification model is obtained by training a facial expression classification model by using a real three-dimensional facial expression animation sample corresponding to the voice sample set, and the facial expression animation sample is provided with emotion labels; Comparing the predicted emotion classification probability of the real expression animation with the difference degree of the predicted emotion classification probability of the expression animation to obtain an emotion consistency loss function; obtaining a third total loss function according to the vertex distance loss function and the emotion consistency loss function; The training parameters of the expression animation generation model are adjusted according to the vertex distance loss function until a preset end condition is met, and the trained expression animation generation model is obtained, which comprises the following steps: And adjusting training parameters of the expression animation generation model according to the third total loss function until a preset ending condition is met, so as to obtain the trained expression animation generation model.
6. The method for generating a facial animation with emotional expression according to claim 5, wherein the predicting the emotional classification probability of the real facial animation and the predicted emotional classification probability of the facial animation using the trained facial expression classification model and the predicted PCA coefficient, respectively, comprises: inputting the predicted PCA coefficient into the trained expression classification model to output the emotion classification probability of the predicted expression animation; acquiring a real three-dimensional facial animation sample set corresponding to the voice sample set, wherein the real three-dimensional facial animation sample set comprises a real three-dimensional facial expression; Projecting the real three-dimensional facial expression into a PCA coefficient of a real three-dimensional deformable facial model; Inputting the PCA coefficient of the real three-dimensional deformable face model into the trained expression classification model to output the emotion classification probability of the real expression animation; and obtaining a third total loss function according to the vertex distance loss function and the emotion consistency loss function, wherein the third total loss function comprises the following steps: And determining the mean square error or root mean square error of the emotion classification probability of the predicted expression animation and the emotion classification probability of the real expression animation as an emotion consistency loss function.
7. The method for generating facial animation expressed in emotion of claim 5, the facial animation generation method based on the emotion expression is characterized by further comprising the following steps of: The trained expression classification model is obtained by adopting the following modes: Acquiring a three-dimensional facial animation sample set corresponding to the voice sample set, wherein the real three-dimensional facial animation sample set comprises a real three-dimensional facial expression; Projecting the real three-dimensional facial expression into a PCA coefficient of a real three-dimensional deformable facial model; inputting the PCA coefficient of the real three-dimensional deformable face model into an expression classification model to output the emotion classification probability of the real expression animation; determining a current loss function according to the emotion labels and the emotion classification probability of the predicted expression animation by using cross entropy; and adjusting training parameters of the expression classification model according to the current loss function until a preset ending condition is met, so as to obtain a trained expression classification model.
8. The emotion expression face animation generation method of any one of claims 1 to 4, wherein projecting PCA coefficients of the predicted three-dimensional face expression as three-dimensional face expression animation data comprises: Converting the PCA coefficient of the predicted three-dimensional facial expression into an expression group of the mixed deformation coefficient according to the corresponding relation between the PCA coefficient of the pre-constructed three-dimensional facial expression and the expression group of the mixed deformation coefficient, wherein the expression group is used for reflecting facial motion, and each group of the three-dimensional deformable facial model corresponds to a group of mixed deformation coefficient; Multiplying and summing the PCA coefficient of the expression base with the PCA coefficient predicted by the expression animation generation model to obtain the coefficient of the expression base; Said redirecting projected facial expression animation data onto said target digital person comprises: the coefficients of the expression base are redirected to the target digital person.
9. A emotion expressed face animation generation device, comprising: the voice acquisition module is used for acquiring voice input by a user; The prediction module is used for inputting the voice into a trained expression animation generation model to output PCA coefficients of the predicted three-dimensional facial expression animation, wherein the trained expression animation generation model is obtained by training the expression animation generation model by using a voice sample set, the expression animation generation model comprises an encoder and a decoder, the encoder encodes voice features into potential expression input decoders, the decoders obtain the PCA coefficients of the three-dimensional facial expression animation, and the decoders comprise 4 1-dimensional time convolution layers and 3 fully connected layers and are used for learning the mapping of the voice features to the PCA coefficients of the expression animation so as to obtain the PCA coefficients of the predicted three-dimensional facial expression animation; the expression animation data projection module is used for projecting the PCA coefficient of the predicted expression animation into expression animation data of the three-dimensional face; a redirection module for redirecting the projected expression animation data onto the target digital person; The method comprises the following steps of obtaining a trained expression animation generation model: the method comprises the steps of determining a mean square error or root mean square error between the distances of upper and lower lip key point pairs of a real three-dimensional face grid model and the distances of the upper and lower lip key point pairs of a predicted three-dimensional face grid model as a lip closing loss function, carrying out linear weighting on a vertex distance loss function and the lip closing loss function to obtain a first total loss function, adjusting training parameters of a surface animation generation model according to the first total loss function until a preset ending condition is met to obtain the trained surface animation generation model, wherein the upper and lower lip key point pairs refer to 5 key points of an upper lip inner ring and 5 corresponding key points of a lower lip inner ring.
10. A computer-readable storage medium, having stored thereon a program which, when executed by a processor, implements the face animation generation method of emotion expression according to any one of claims 1 to 8.

Description

Emotion expression human face animation generation method, emotion expression human face animation generation device and readable storage medium Technical Field The invention relates to the technical field of digital human intelligence, in particular to a method and a device for generating a face animation of emotion expression and a readable storage medium. Background Along with the expansion of the application of artificial intelligence technologies such as natural language processing, voice recognition, computer vision and the like, the development of virtual digital human technology is also oriented to more intelligent and diversified directions. Early digital persons are mainly applied to the field of general entertainment, such as the industries of movies, animations, games and the like, and nowadays, digital persons are successfully applied to various industries of banking, medical treatment, education, government affairs, communication and the like. Wherein, having emotion expression and interactive communication capability is the basis for realizing the interaction between the digital person and the real world. However, the digital human face animation in the related technology is stiff and hard, and the experience of the user in the interaction process is reduced. Disclosure of Invention The application provides a facial animation generation method and device for emotion expression and a readable storage medium, wherein the facial animation of a digital human face is natural, and the experience of a user in the interaction process is improved. The application provides a facial animation generation method of emotion expression, which comprises the steps of obtaining voice input by a user, inputting the voice into a trained facial animation generation model to output a PCA coefficient of a predicted three-dimensional facial expression animation, inputting the trained facial animation generation model into the facial animation generation model by using a voice sample set to train, projecting the PCA coefficient of the predicted facial expression animation into facial expression animation data of the three-dimensional facial expression, and redirecting the projected facial expression animation data on a target digital person. Further, the facial animation generation method of emotion expression further comprises the following steps of obtaining a trained facial animation generation model by the following steps: The method comprises the steps of obtaining a voice sample set, wherein the voice sample set comprises vertex coordinates of a real three-dimensional face grid model, the vertex coordinates of the real three-dimensional face grid model comprise characteristic key points of the real three-dimensional face grid model, inputting the voice sample set into an expression animation generation model to predict PCA coefficients of three-dimensional face expressions corresponding to the voice sample set, converting the predicted PCA coefficients into the vertex coordinates of the predicted three-dimensional face grid model, comparing the difference degree of the vertex coordinates of the real three-dimensional face grid model and the vertex coordinates of the predicted three-dimensional face grid model to obtain a vertex distance loss function, and adjusting training parameters of the expression animation generation model according to the vertex distance loss function until preset end conditions are met to obtain the trained expression animation generation model. Further, the vertex coordinates of the predicted three-dimensional face mesh model comprise characteristic key points of the predicted three-dimensional face mesh model; the facial animation generation method of the emotion expression further comprises the steps of comparing the difference degree of characteristic key points of the real three-dimensional facial mesh model and the predicted three-dimensional facial mesh model to obtain a characteristic-related loss function, obtaining a total loss function according to the vertex distance loss function and the characteristic-related loss function, and adjusting the training parameters of the facial animation generation model according to the vertex distance loss function until the preset end condition is met to obtain the trained facial animation generation model, wherein the training parameters of the facial animation generation model are adjusted according to the total loss function until the preset end condition is met to obtain the trained facial animation generation model. Further, the characteristic key points of the predicted three-dimensional face mesh model comprise upper and lower lip key point pairs of the predicted three-dimensional face mesh model; the method comprises the steps of determining a mean square error or root mean square error between the distances of an upper lip key point pair and a lower lip key point pair of a real three-dimensional face mesh model and the distances o