CN-122001997-A - Video frame inserting method, video generating method, storage medium and electronic equipment

CN122001997ACN 122001997 ACN122001997 ACN 122001997ACN-122001997-A

Abstract

The invention provides a video frame inserting method, a video generating method, a storage medium and electronic equipment, which are applied to the technical field of artificial intelligence. According to the method, the modulation residual error is generated between the initial motion potential vector and the final motion potential vector based on the interpolation network of sequential self-attention and multi-scale convolution, and the modulation residual error is added with the linear interpolation vector to form the continuous smooth intermediate motion potential vector, so that seamless connection of a video motion potential vector sequence driven by multi-section audio is realized, the problems of unnatural frame insertion, transition jump and insufficient instantaneity when a digital human video is switched between different states are effectively solved, and continuity and real-time interaction experience of video generation are improved.

Inventors

LIU JINSONG
SHI YANG
ZHENG RUIFENG

Assignees

成都灵动毕方科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260209

Claims (10)

1. A method for video framing, comprising: obtaining an initial motion potential vector and a final motion potential vector; Performing linear interpolation on the initial motion potential vector and the final motion potential vector to obtain a linear interpolation vector corresponding to at least one interpolation time point; Generating a conditional feature sequence by using the initial motion potential vector, the final motion potential vector and the time embedded vector of each frame inserting time point; Inputting the conditional feature sequence into an interpolation network, so that the interpolation network processes the conditional feature sequence based on time sequence self-attention and multi-scale one-dimensional convolution, and outputting residual vectors corresponding to each frame inserting time point; The linear interpolation vector corresponding to each interpolation time point is added to the modulated corresponding residual vector to generate at least one intermediate motion potential vector located between the starting motion potential vector and the ending motion potential vector.
2. The video interpolation method of claim 1, wherein generating the conditional feature sequence using the start motion potential vector, the end motion potential vector, and the temporal embedding vector for each interpolation time point comprises: Determining at least one frame inserting time point in a designated frame inserting interval; obtaining a time embedded vector corresponding to each frame inserting time point through a position coding function; splicing the initial motion potential vector, the termination motion potential vector and the time embedded vector of each frame inserting time point to obtain the condition characteristics of each frame inserting time point; and obtaining the conditional feature sequence by using the conditional features of all the frame inserting time points.
3. The video interpolation method of claim 2, wherein prior to said adding the linear interpolation vector corresponding to each interpolation time point to the modulated corresponding residual vector, the method further comprises: multiplying the residual vector corresponding to each frame inserting time point by the value of a boundary constraint shape function at the frame inserting time point to obtain a modulated residual vector, wherein the value of the boundary constraint shape function at the starting point and the ending point of the frame inserting interval is zero.
4. The video interpolation method of claim 3, wherein the boundary constraint shape function is , wherein, Is the interpolation time point.
5. The video interpolation method of claim 1, wherein the training process of the interpolation network comprises: Obtaining a sequence of continuous motion potential vectors generated by a flow model; Selecting a motion potential vector corresponding to a starting frame from the continuous motion potential vector sequence as a starting motion potential vector sample, and selecting a motion potential vector corresponding to a stopping frame as a stopping motion potential vector sample; Selecting motion potential vectors of at least one intermediate frame between the start frame and the end frame from the continuous motion potential vector sequence as supervision samples; And training a network comprising a time sequence self-attention module and a multi-scale one-dimensional convolution module by taking the initial motion potential vector sample and the termination motion potential vector sample as inputs and the supervision sample as a truth value label to obtain the trained interpolation network.
6. The video interpolation method of claim 5, wherein the loss function used in training the interpolation network comprises at least one of a reconstruction loss, a first order smoothing loss, a second order smoothing loss, and a residual energy constraint loss.
7. A video generation method, comprising: Obtaining a character prototype reference image and at least two sections of audio input; Generating an identity code characterizing an identity feature based on the personality prototype reference image; Generating a first motion potential vector sequence corresponding to a first section of audio through a flow model based on the identity code and the first section of audio; Generating a second motion potential vector sequence corresponding to the second section of audio through a flow model based on the identity code and the second section of audio; The video interpolation method of any one of claims 1 to 6, interpolating with a last frame motion potential vector of the first sequence of motion potential vectors as a start motion potential vector and a first frame motion potential vector of the second sequence of motion potential vectors as a stop motion potential vector to generate a transitional sequence of motion potential vectors, wherein the transitional sequence of motion potential vectors comprises a plurality of intermediate motion potential vectors; inserting the transitional motion potential vector sequence between the first motion potential vector sequence and the second motion potential vector sequence to obtain a continuous target motion potential vector sequence; Based on the sequence of object motion potential vectors and the identity code, a continuous video stream is synthesized.
8. The video generation method of claim 7, wherein after the obtaining of the character prototype reference image and the at least two pieces of audio input, the method further comprises: generating a motion potential vector sequence in an idle state through a flow model based on the character prototype reference image and blank audio; the generating, by a flow model, a first motion potential vector sequence corresponding to the first piece of audio based on the identity code and the first piece of audio includes: when the first section of audio is received, determining an interpolation starting point according to the motion potential vector of the last frame of the motion potential vector sequence in the idle state; The video interpolation method according to any one of claims 1 to 6, wherein interpolation is performed with a motion potential vector corresponding to the interpolation start point as a start motion potential vector and a first frame motion potential vector generated based on the first segment of audio as a stop motion potential vector to generate a transitional motion potential vector sequence from an idle state to a conversation state; and generating the rest motion potential vectors except for the first frame based on the first section of audio, and splicing the rest motion potential vectors with the transition motion potential vector sequence to generate the first motion potential vector sequence.
9. A computer-readable storage medium having stored thereon a program, wherein the program when executed by a processor implements the video interpolation method according to any one of claims 1 to 6 and/or the video generation method according to any one of claims 7 to 8.
10. An electronic device comprising at least one processor and at least one memory and a bus connected to the processor, wherein the processor and the memory are in communication with each other via the bus, and wherein the processor is configured to invoke program instructions in the memory to perform the video framing method according to any of claims 1 to 6 and/or the video generating method according to any of claims 7 to 8.

Description

Video frame inserting method, video generating method, storage medium and electronic equipment Technical Field The present invention relates to the field of artificial intelligence technologies, and in particular, to a video frame inserting method, a video generating method, a storage medium, and an electronic device. Background With the development of the generated artificial intelligence technology, the digital human video dialogue system based on single portrait drive is widely applied to scenes such as online customer service, virtual anchor, real-time assistant and the like. The system generates digital human videos by inputting reference pictures and audio, and achieves real-time interaction. However, the prior art has the defects that firstly, an external frame inserting model which is not specially optimized is relied on, so that the transition speed is low, the effect is unstable, the instantaneity is affected, secondly, an image level buffer memory is adopted, new video is generated each time, the first frame is required to be recoded, the response speed is reduced, thirdly, a serial generation flow is generally used, all video frames are required to be generated first, then the transition is carried out, the first frame watching delay of a user side is increased, and thirdly, when Idle and Talk states are switched, a natural transition animation is absent, the direct jump causes visual jump, and the continuity and the sense of reality of the video are affected. Therefore, how to optimize the frame inserting technology to improve the fluency and interactive experience of the digital human video becomes a technical problem to be solved by those skilled in the art. Disclosure of Invention In view of the above problems, the present invention provides a video frame inserting method, a video generating method, a storage medium, and an electronic device, which overcome or at least partially solve the above problems, and the technical solutions are as follows: A video frame insertion method, comprising: obtaining an initial motion potential vector and a final motion potential vector; Performing linear interpolation on the initial motion potential vector and the final motion potential vector to obtain a linear interpolation vector corresponding to at least one interpolation time point; Generating a conditional feature sequence by using the initial motion potential vector, the final motion potential vector and the time embedded vector of each frame inserting time point; Inputting the conditional feature sequence into an interpolation network, so that the interpolation network processes the conditional feature sequence based on time sequence self-attention and multi-scale one-dimensional convolution, and outputting residual vectors corresponding to each frame inserting time point; The linear interpolation vector corresponding to each interpolation time point is added to the modulated corresponding residual vector to generate at least one intermediate motion potential vector located between the starting motion potential vector and the ending motion potential vector. Optionally, the generating a conditional feature sequence using the initial motion potential vector, the final motion potential vector, and the time embedded vector of each frame insertion time point includes: Determining at least one frame inserting time point in a designated frame inserting interval; obtaining a time embedded vector corresponding to each frame inserting time point through a position coding function; splicing the initial motion potential vector, the termination motion potential vector and the time embedded vector of each frame inserting time point to obtain the condition characteristics of each frame inserting time point; and obtaining the conditional feature sequence by using the conditional features of all the frame inserting time points. Optionally, before said adding the linear interpolation vector corresponding to each interpolation time point to the modulated corresponding residual vector, the method further comprises: multiplying the residual vector corresponding to each frame inserting time point by the value of a boundary constraint shape function at the frame inserting time point to obtain a modulated residual vector, wherein the value of the boundary constraint shape function at the starting point and the ending point of the frame inserting interval is zero. Optionally, the boundary constraint shape function is, wherein,Is the interpolation time point. Optionally, the training process of the interpolation network includes: Obtaining a sequence of continuous motion potential vectors generated by a flow model; Selecting a motion potential vector corresponding to a starting frame from the continuous motion potential vector sequence as a starting motion potential vector sample, and selecting a motion potential vector corresponding to a stopping frame as a stopping motion potential vector sample; Selecting motion potential