CN-122002095-A - Video generation method, device, equipment, medium and product

CN122002095ACN 122002095 ACN122002095 ACN 122002095ACN-122002095-A

Abstract

The invention discloses a video generation method, a device, equipment, a medium and a product. The method comprises the steps of obtaining a driving voice and a reference image, wherein the reference image comprises a target object, inputting the reference image and the driving voice into a target network to obtain a target mouth motion sequence corresponding to the target object, inputting the target mouth motion sequence and the reference image into a denoising network to generate a target video corresponding to the target object. According to the technical scheme, the problems of inconsistent mouth shape and audio content and inaccurate generation result in the existing technical scheme for voice-driven digital human video generation can be solved, the method is not only suitable for the traditional digital media field, but also can be expanded to multiple fields of virtual reality, augmented reality, game development, movie production and the like, and richer experience is provided for users.

Inventors

Di Donglin
Liu Huaize
LI HAO
CHEN WEI

Assignees

北京罗克维尔斯科技有限公司

Dates

Publication Date: 20260508
Application Date: 20241107

Claims (12)

1. A video generation method, comprising: Acquiring driving voice and a reference image, wherein the reference image comprises a target object; Inputting the reference image and the driving voice into a target network to obtain a target mouth motion sequence corresponding to the target object; and inputting the target mouth motion sequence and the reference image into a denoising network to generate a target video corresponding to the target object.
2. The method of claim 1, wherein inputting the target mouth motion sequence and the reference image into a denoising network generates a target video corresponding to the target object, comprising: Inputting the reference image into a reference network to obtain target visual characteristics and target semantic information corresponding to the target person in the reference image; and inputting the target mouth motion sequence, the target visual characteristics and the target semantic information into a denoising network to generate a target video corresponding to the target object.
3. The method of claim 2, wherein inputting the target mouth motion sequence, the target visual features, and the target semantic information into a denoising network generates a target video corresponding to the target object, comprising: Acquiring a noise image sequence; splicing the target mouth motion sequence and the noise image sequence to obtain a target noise image sequence; and inputting the target noise image sequence, the target visual characteristics and the target semantic information into a denoising network to generate a target video corresponding to the target object.
4. A method according to claim 3, wherein acquiring a sequence of noisy images comprises: Acquiring a target mean value and a target standard deviation; generating a noise image sequence with the same length as the target mouth motion sequence according to the target mean value and the target standard deviation; correspondingly, the target mouth motion sequence and the noise image sequence are spliced to obtain a target noise image sequence, which comprises the following steps: And splicing the mouth motion sequence image of each frame in the target mouth motion sequence with the noise sequence image of the same frame in the noise image sequence according to the characteristic dimension to obtain a target noise image sequence.
5. A method according to claim 3, wherein the denoising network comprises a first spatial attention layer; inputting the target noise image sequence, the target visual characteristics and the target semantic information into a denoising network to generate a target video corresponding to the target object, wherein the method comprises the following steps of: inputting the target visual characteristics and the target semantic information into a denoising network to generate a plurality of video frames; And merging each video frame with each sequence frame in the target noise image sequence according to the space dimension respectively to generate a target video corresponding to the target object.
6. The method of claim 5, wherein the denoising network comprises a temporal attention layer; Combining each video frame with each sequence frame in the target noise image sequence according to a space dimension respectively to generate a target video corresponding to the target object, wherein the method comprises the following steps: and merging each video frame with each sequence frame in the target noise image sequence according to the space dimension and the time dimension respectively to generate a target video corresponding to the target object.
7. The method of claim 2, wherein the reference network comprises a second spatial attention layer and a cross-attention layer; Inputting the reference image into a reference network to obtain target visual characteristics and target semantic information corresponding to the target person in the reference image, wherein the method comprises the following steps: inputting the reference image into a visual feature extraction encoder to obtain initial visual features corresponding to the target person in the reference image; inputting the initial visual characteristics into a second spatial attention layer of a reference network, and obtaining target visual characteristics after spatial attention operation; inputting the reference image into a facial feature extraction encoder to obtain initial semantic information corresponding to the target person in the reference image; And inputting the initial semantic information into a cross-attention layer of a reference network, and obtaining target semantic information after performing cross-attention operation.
8. The method of claim 1, wherein inputting the reference image and the driving voice into a target network to obtain a target mouth motion sequence corresponding to the target object, comprises: extracting features of the reference image to obtain mouth key point information corresponding to the target object; Screening out the mouth movement position information corresponding to the target object from the mouth key point information corresponding to the target object; determining an initial mouth motion sequence corresponding to the target object according to the mouth motion position information corresponding to the target object; extracting the characteristics of the driving voice to obtain an audio characteristic sequence; And inputting the initial mouth motion sequence and the audio feature sequence into a target network to perform diffusion operation and denoising operation, so as to obtain a target mouth motion sequence corresponding to the target object.
9. A video generating apparatus, comprising: the acquisition module is used for acquiring driving voice and a reference image, wherein the reference image comprises a target object; The first input module is used for inputting the reference image and the driving voice into a target network to obtain a target mouth motion sequence corresponding to the target object; and the second input module is used for inputting the target mouth motion sequence and the reference image into a denoising network and generating a target video corresponding to the target object.
10. An electronic device, the electronic device comprising: At least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the video generation method of any one of claims 1-8.
11. A computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, the computer instructions for causing a processor to perform the video generation method of any of claims 1-8 when executed.
12. A computer program product comprising a computer program which, when executed by a processor, implements the video generation method according to any of claims 1-8.

Description

Video generation method, device, equipment, medium and product Technical Field The embodiment of the invention relates to the technical field of computer vision, in particular to a video generation method, a device, equipment, a medium and a product. Background The voice-driven digital human video generation is defined as that a section of driving data (audio or video, etc.) and an image or video containing a target person are given, and the information contained in the inputs is used for synthesizing a section of video containing the target person and naturally expressing the driving data and having more abundant information. However, although the prior art has made some progress in speaking video by digital persons, there are still many problems, and thus the wide application of digital person video generation technology in the fields of virtual reality, game development, social media, etc. is limited. Disclosure of Invention The embodiment of the invention provides a video generation method, a device, equipment, a medium and a product, which can solve the problems of inconsistent mouth shape and audio content and inaccurate generation result in the existing technical scheme of voice-driven digital human video generation. According to an aspect of the present invention, there is provided a video generating method including: Acquiring driving voice and a reference image, wherein the reference image comprises a target object; Inputting the reference image and the driving voice into a target network to obtain a target mouth motion sequence corresponding to the target object; and inputting the target mouth motion sequence and the reference image into a denoising network to generate a target video corresponding to the target object. According to another aspect of the present invention, there is provided a video generating apparatus including: the acquisition module is used for acquiring driving voice and a reference image, wherein the reference image comprises a target object; The first input module is used for inputting the reference image and the driving voice into a target network to obtain a target mouth motion sequence corresponding to the target object; and the second input module is used for inputting the target mouth motion sequence and the reference image into a denoising network and generating a target video corresponding to the target object. According to another aspect of the present invention, there is provided an electronic apparatus including: At least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the video generation method of any one of the embodiments of the present invention. According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to execute a video generation method according to any one of the embodiments of the present invention. According to another aspect of the present invention, embodiments of the present invention also provide a computer program product comprising a computer program which, when executed by a processor, implements the video generation method according to any of the embodiments of the present invention. According to the embodiment of the invention, the driving voice and the reference image are acquired, the reference image contains the target object, the reference image and the driving voice are input into the target network for diffusion denoising, the mouth shape and the audio content are ensured to be consistent, the target mouth motion sequence corresponding to the target object is obtained, then the target mouth motion sequence and the reference image are input into the denoising network, the continuity of video generation is realized, and the target video corresponding to the target object is finally generated. According to the technical scheme, the problems of inconsistent mouth shape and audio content and inaccurate generation result in the existing technical scheme for voice-driven digital human video generation can be solved, the method is not only suitable for the traditional digital media field, but also can be expanded to multiple fields of virtual reality, augmented reality, game development, movie production and the like, and richer experience is provided for users. It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows. Drawings In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following dra