CN-115861461-B - Voice-driven face generation method based on deep learning

CN115861461BCN 115861461 BCN115861461 BCN 115861461BCN-115861461-B

Abstract

The invention discloses a voice-driven face generation method based on deep learning, which relates to the technical field of pattern recognition, solves the technical problem that faces conforming to text information cannot be automatically and accurately generated, the key points of the technical scheme are that targeted improvement is made on a system frame aiming at the input type of the driving face, a voice recognition module is innovatively added, the basic facial features of the face are directly obtained through voice recognition, and the face generation flow is simplified; meanwhile, the similarity of mapping relations among different face generation models in the StyleGAN model is fully utilized, the face shape of a certain style can be generated in a targeted mode, and the method has important practical significance in the criminal investigation field.

Inventors

LI PEILIN
WU ZHUOJUN
XIA SIYU

Assignees

东南大学

Dates

Publication Date: 20260512
Application Date: 20220927

Claims (7)

1. The voice-driven face generation method based on deep learning is characterized by comprising the following steps of: S1, recognizing a voice signal and converting the voice signal into corresponding text information; S2, inverting the StyleGAN model generator to obtain an image encoder, and training the text encoder of the CLIP model to enable the text vector And image vector If the distance between the image encoder and the trained text encoder is the smallest, the image encoder and the trained text encoder are connected with a synthetic network of StyleGAN models to form a progressive generation countermeasure network; S3, inputting the text information into the progressive generation countermeasure network to obtain a potential code W corresponding to the text information; s4, inputting the potential codes W into a generator of the StyleGAN model to generate a human face; in the step S2, the inverting the generator of the StyleGAN model to obtain the image encoder includes: image encoder incorporating one By means of an image encoder Re-encoding the real image x to map the real image x to Nearby, expressed as: ;(1) Wherein, the Representing a vector Z in s space; a generator representing StyleGAN a model; the generator representing StyleGAN model is based on vectors Generating an image; Representation passing image encoder For images Coding is carried out again; and additionally adding text information of the real image x, and mapping the text information to a potential space w, so as to obtain semantic information of the real image x, wherein the semantic information is expressed as: ;(2) ;(3) Wherein F (-) represents a feature extraction network VGG; Represents an L2 distance; representing a variance calculation formula; Representing an image discriminator; 、 And All represent super parameters; Image encoder by the formulas (2) to (3) Training is performed to make it have the function of reasoning from the real image x Thereby yielding the image encoder.
2. The method of claim 1, wherein the step S1 is implemented by a feature extraction network, an acoustic model, a language model, a dictionary, and decoding.
3. The method of claim 2, wherein the speech signal is pre-processed, including filtering and framing, prior to being fed into the feature extraction network for feature extraction.
4. The method of claim 1, wherein the potential code W is a vector of one (18,512) dimension, and wherein the potential code W establishes a mapping from the (18,512) dimension vector to the (1024,1024,3) dimension vector by a generator of the StyleGAN model.
5. The method of claim 1, wherein the StyleGAN model is trained on 10000 asian-style face datasets.
6. The method of claim 1, wherein in step S2, the text encoder of the CLIP model is trained such that text vectors And image vector If the distance is the smallest, a trained text encoder is obtained, comprising: Encoding the text and the image respectively through the CLIP model, the text obtains the vector of the potential space w through a text encoder of the CLIP model The image is passed through the image encoder of the CLIP model to obtain the vector of the potential space w Training a text encoder of a CLIP model such that And The trained text encoder is obtained by equation (4), expressed as: ;(4) Wherein, the Representing a text encoder; Representing the weight of the generator i-th input layer.
7. The method according to claim 1, wherein in the step S1, the voice signal is identified based on the hundred degree API, and the API is called through API keys and SECRET KEY.

Description

Voice-driven face generation method based on deep learning Technical Field The application relates to the technical field of pattern recognition, in particular to a voice driving face generation method based on deep learning. Background The advent of voice-driven face generation technology originated from the urgent need for analog representations, which would be greatly aided if the computer were able to provide assistance in rendering more accurate representations. The simulated image pursues the characteristic image, and ensures that eyes, nose, mouth and eyebrows are exactly similar. In addition to striving to restore looks, attention is paid to modeling of the mind and changes in facial features of the character at different age stages. By dictation, an image containing corresponding descriptive information is automatically generated, and the method has important practical significance. Disclosure of Invention The application provides a voice-driven face generation method based on deep learning, which aims to automatically and accurately generate a face comprising corresponding characteristics according to dictation information. The technical aim of the application is realized by the following technical scheme: A voice-driven face generation method based on deep learning comprises the following steps: S1, recognizing a voice signal and converting the voice signal into corresponding text information; S2, inverting a generator of the StyleGAN model to obtain an image encoder, training a text encoder of the CLIP model to enable the distance between a text vector w l and an image vector w v to be minimum, obtaining a trained text encoder, and connecting the image encoder and the trained text encoder with a synthesis network of the StyleGAN model to form a progressive generation countermeasure network; S3, inputting the text information into the progressive generation countermeasure network to obtain a potential code W corresponding to the text information; and S4, inputting the potential codes W into a generator of the StyleGAN model to generate a human face. The application has the advantages that the application makes targeted improvement on the system frame aiming at the input type of the driving face, innovatively adds the voice recognition module, directly obtains the basic facial features of the face through voice recognition, simplifies the face generation process, fully utilizes the similarity of the mapping relation between different face generation models in StyleGAN2 models, and can generate the face shape of a certain style in a targeted way. Drawings FIG. 1 is a flow chart of the method of the present application; fig. 2 is a face effect diagram generated in an embodiment. Detailed Description The technical scheme of the application will be described in detail with reference to the accompanying drawings. As shown in fig. 1, the method for generating a voice-driven face based on deep learning according to the present application includes: s1, recognizing the voice signal and converting the voice signal into corresponding text information. Specifically, the above-described speech recognition process is realized by a feature extraction network, an acoustic model, a language model, a dictionary, and decoding. In order to extract the characteristics more effectively, the audio data preprocessing work such as filtering, framing and the like is also required to be carried out on the collected sound signals, and the audio signals to be analyzed are extracted from the original signals appropriately. The audio to be detected should include basic facial features of the particular person including, but not limited to, gender, age, nose shape, lip thickness, hair color and length, whether glasses are worn, etc. As a specific embodiment, the voice signal is identified based on the hundred degree API, and the API is called through the API Key and SECRET KEY. And S2, inverting a generator of the StyleGAN model to obtain an image encoder, training a text encoder of the CLIP model to enable the distance between a text vector w l and an image vector w v to be minimum, obtaining a trained text encoder, and connecting the image encoder and the trained text encoder with a synthesis network of the StyleGAN model to form a progressive generation countermeasure network. Specifically, the inversion concept is to introduce an additional image encoder E v, and encode the real image x again by the image encoder E v, so that the real image x is mapped near Z s, which is expressed as: Where Z s represents the vector Z in s-space, G (-) represents the generator of the StyleGAN model, G (Z s) represents the generator of the StyleGAN model to generate an image from the vector Z s, E v(G(Zs)) represents the re-encoding of the image G (Z s) by the image encoder E v. And then adding text information of the real image x additionally, and mapping the text information to a potential space w, thereby obtaining semantic information of