CN-115222859-B - Image animation method, electronic device and computer program product

CN115222859BCN 115222859 BCN115222859 BCN 115222859BCN-115222859-B

Abstract

According to an implementation of the present disclosure, a scheme for generating video from an image is presented. In this scheme, an input image and a reference video are acquired. Based on the reference video, a motion pattern of a reference object in the reference video is determined. An output video is generated with an input image as a start frame, and a motion of a target object in the input image in the output video has a motion pattern of a reference object. In this way, the scheme can intuitively apply the motion pattern of the reference object in the reference video to the input image to generate the output video, and the motion of the target object in the output video has the motion pattern of the reference object.

Inventors

LIU BEI
YANG HUAN
FU JIANLONG

Assignees

微软技术许可有限责任公司

Dates

Publication Date: 20260505
Application Date: 20210416

Claims (20)

1. A computer-implemented method, comprising: Acquiring an input image and a reference video, wherein the input image comprises a plurality of target objects, and the reference video comprises a plurality of reference objects corresponding to the target objects; determining a motion pattern of the plurality of reference objects in the reference video based on the reference video, and Generating an output video taking the input image as a starting frame, wherein the motion of a target object in the plurality of target objects in the output video has a motion mode of a reference object corresponding to the target object; Wherein determining the motion pattern of the reference object comprises: Determining a first reference motion feature based on a first frame and a subsequent second frame of the reference video, the first reference motion feature being used to characterize motion from the first frame to the second frame; determining a first set of reference objects in the first frame by generating a first set of semantic segmentation masks for the first frame, the first set of semantic segmentation masks indicating respective locations of the first set of reference objects in the first frame; the first reference motion feature is partially convolved with the first set of semantic segmentation masks to determine a motion pattern of the first set of reference objects.
2. The method of claim 1, wherein generating the output video comprises: Based on the semantic mapping of the reference object to the target object, the motion pattern of the reference object is migrated to the target object.
3. The method of claim 1, wherein generating the output video comprises at least one of: migrating a motion pattern of the reference object to the target object based on a predetermined rule indicating a mapping of the reference object to the target object, and The motion pattern of the additional reference object is migrated to the additional target object based on a mapping of the additional reference object of the additional reference video to the additional target object in the input image.
4. The method of claim 1, wherein generating the output video comprises: Determining at least one object in the input image by generating at least one semantic segmentation mask of the input image, the at least one object comprising the target object, the at least one semantic segmentation mask indicating a respective location of the at least one object in the input image; Determining a combined motion pattern for the input image based on the respective motion pattern of the at least one reference object and the at least one semantic segmentation mask by determining a semantic mapping of the at least one reference object to the at least one object in the first set of reference objects; generating a first predicted motion feature for the input image using a convolutional neural network based on the combined motion pattern and the input image, and And performing image transformation on the input image by using the first predicted motion characteristic to generate a first output frame positioned behind the initial frame in the output video.
5. The method of claim 4, wherein generating the output video further comprises: Determining a second reference motion feature based on the second frame and a subsequent third frame of the reference video, the second reference motion feature being used to characterize motion from the second frame to the third frame; Generating a second set of semantic segmentation masks for the second frame, the second set of semantic segmentation masks indicating respective locations of the first set of reference objects in the second frame; Partially convolving the second reference motion feature with the second set of semantic segmentation masks to determine a second motion pattern of the first set of reference objects, and A second output frame in the output video is generated that follows the first output frame, movement of the target object from the first output frame to the second output frame having a corresponding second pattern of movement of the reference objects in the first set of reference objects.
6. A computer-implemented method, comprising: Acquiring a training video, wherein the training video comprises a first training frame and a subsequent second training frame; Determining a respective motion pattern for each training object of a set of training objects in the first training frame based on the first training frame and the second training frame; generating a predictive video for the training video using a machine learning model based on the respective motion patterns, wherein motion of an object in the predictive video has a motion pattern of the object in the training video and the predictive video includes the first training frame and a subsequent predictive frame for the second training frame, and Training the machine learning model based at least on the predicted frame and the second training frame; wherein determining the respective motion patterns of the set of training objects comprises: Determining training motion features based on the first training frame and the second training frame, the training motion features being used to characterize motion from the first training frame to the second training frame; Determining the set of training objects in the first training frame by generating a set of training semantic segmentation masks for the first training frame, the set of training semantic segmentation masks indicating respective positions of the set of training objects in the first training frame; performing the same spatial transformation on the training motion features and the set of training semantic segmentation masks, respectively; The transformed training motion features are partially convolved with the transformed set of training semantic segmentation masks to determine respective motion patterns of the set of training objects.
7. The method of claim 6, wherein generating the predictive video comprises: the predicted frame is generated, and motion of the set of training objects from the first training frame to the predicted frame has the respective motion pattern.
8. The method of claim 6, wherein generating the predicted frame comprises: Determining a combined motion pattern for the first training frame based on the respective motion patterns of the set of training objects and the set of training semantic segmentation masks; Generating predicted motion features for the first training frame using a convolutional neural network based on the combined motion pattern and the first training frame, and And carrying out image transformation on the first training frame by utilizing the predicted motion characteristics to generate the predicted frame.
9. The method of claim 8, wherein training the machine learning model comprises: Determining at least one loss function for training the machine learning model; determining an objective loss function by weighted summing the at least one loss function, and The machine learning model is trained by minimizing the objective loss function.
10. The method of claim 8, wherein determining the at least one loss function comprises at least one of: Determining a frame loss function based on a difference between the predicted frame and the second training frame; Determining a motion feature loss function based on a difference between the predicted motion feature and the training motion feature, and A smoothness loss function is determined based on the spatial distribution of the predicted motion features and the spatial distribution of the predicted frames.
11. An electronic device, comprising: a processing unit, and A memory coupled to the processing unit and containing instructions stored thereon, which when executed by the processing unit, cause the electronic device to perform actions comprising: Acquiring an input image and a reference video, wherein the input image comprises a plurality of target objects, and the reference video comprises a plurality of reference objects corresponding to the target objects; determining a motion pattern of the plurality of reference objects in the reference video based on the reference video, and Generating an output video taking the input image as a starting frame, wherein the motion of a target object in the plurality of target objects in the output video has a motion mode of a reference object corresponding to the target object; Wherein determining the motion pattern of the reference object comprises: Determining a first reference motion feature based on a first frame and a subsequent second frame of the reference video, the first reference motion feature being used to characterize motion from the first frame to the second frame; determining a first set of reference objects in the first frame by generating a first set of semantic segmentation masks for the first frame, the first set of semantic segmentation masks indicating respective locations of the first set of reference objects in the first frame; the first reference motion feature is partially convolved with the first set of semantic segmentation masks to determine a motion pattern of the first set of reference objects.
12. The electronic device of claim 11, wherein generating the output video comprises: Based on the semantic mapping of the reference object to the target object, the motion pattern of the reference object is migrated to the target object.
13. The electronic device of claim 12, wherein generating the output video comprises at least one of: migrating a motion pattern of the reference object to the target object based on a predetermined rule indicating a mapping of the reference object to the target object, and The motion pattern of the additional reference object is migrated to the additional target object based on a mapping of the additional reference object of the additional reference video to the additional target object in the input image.
14. The electronic device of claim 11, wherein generating the output video comprises: Determining at least one object in the input image by generating at least one semantic segmentation mask of the input image, the at least one object comprising the target object, the at least one semantic segmentation mask indicating a respective location of the at least one object in the input image; Determining a combined motion pattern for the input image based on the respective motion pattern of the at least one reference object and the at least one semantic segmentation mask by determining a semantic mapping of the at least one reference object to the at least one object in the first set of reference objects; generating a first predicted motion feature for the input image using a convolutional neural network based on the combined motion pattern and the input image, and And performing image transformation on the input image by using the first predicted motion characteristic to generate a first output frame positioned behind the initial frame in the output video.
15. An electronic device, comprising: a processing unit, and A memory coupled to the processing unit and containing instructions stored thereon, which when executed by the processing unit, cause the electronic device to perform actions comprising: Acquiring a training video, wherein the training video comprises a first training frame and a subsequent second training frame; Determining a respective motion pattern for each training object of a set of training objects in the first training frame based on the first training frame and the second training frame; generating a predictive video for the training video using a machine learning model based on the respective motion patterns, wherein motion of an object in the predictive video has a motion pattern of the object in the training video and the predictive video includes the first training frame and a subsequent predictive frame for the second training frame, and Training the machine learning model based at least on the predicted frame and the second training frame; wherein determining the respective motion patterns of the set of training objects comprises: Determining training motion features based on the first training frame and the second training frame, the training motion features being used to characterize motion from the first training frame to the second training frame; Determining the set of training objects in the first training frame by generating a set of training semantic segmentation masks for the first training frame, the set of training semantic segmentation masks indicating respective positions of the set of training objects in the first training frame; performing the same spatial transformation on the training motion features and the set of training semantic segmentation masks, respectively; The transformed training motion features are partially convolved with the transformed set of training semantic segmentation masks to determine respective motion patterns of the set of training objects.
16. The electronic device of claim 15, wherein generating the predictive video comprises: the predicted frame is generated, and motion of the set of training objects from the first training frame to the predicted frame has the respective motion pattern.
17. The electronic device of claim 15, wherein generating the predicted frame comprises: Determining a combined motion pattern for the first training frame based on the respective motion patterns of the set of training objects and the set of training semantic segmentation masks; Generating predicted motion features for the first training frame using a convolutional neural network based on the combined motion pattern and the first training frame, and And carrying out image transformation on the first training frame by utilizing the predicted motion characteristics to generate the predicted frame.
18. The electronic device of claim 17, wherein training the machine learning model comprises: Determining at least one loss function for training the machine learning model; determining an objective loss function by weighted summing the at least one loss function, and The machine learning model is trained by minimizing the objective loss function.
19. A computer program product comprising machine executable instructions that, when executed by an apparatus, cause the apparatus to perform actions comprising: Acquiring an input image and a reference video, wherein the input image comprises a plurality of target objects, and the reference video comprises a plurality of reference objects corresponding to the target objects; determining a motion pattern of the plurality of reference objects in the reference video based on the reference video, and Generating an output video taking the input image as a starting frame, wherein the motion of a target object in the plurality of target objects in the output video has a motion mode of a reference object corresponding to the target object; Wherein determining the motion pattern of the reference object comprises: Determining a first reference motion feature based on a first frame and a subsequent second frame of the reference video, the first reference motion feature being used to characterize motion from the first frame to the second frame; determining a first set of reference objects in the first frame by generating a first set of semantic segmentation masks for the first frame, the first set of semantic segmentation masks indicating respective locations of the first set of reference objects in the first frame; the first reference motion feature is partially convolved with the first set of semantic segmentation masks to determine a motion pattern of the first set of reference objects.
20. A computer program product comprising machine executable instructions that, when executed by an apparatus, cause the apparatus to perform actions comprising: Acquiring a training video, wherein the training video comprises a first training frame and a subsequent second training frame; Determining a respective motion pattern for each training object of a set of training objects in the first training frame based on the first training frame and the second training frame, generating a predictive video for the training video using a machine learning model based on the respective motion patterns, wherein motion of objects in the predictive video has a motion pattern of the objects in the training video and the predictive video comprises the first training frame and a subsequent predictive frame for the second training frame, and Training the machine learning model based at least on the predicted frame and the second training frame; wherein determining the respective motion patterns of the set of training objects comprises: Determining training motion features based on the first training frame and the second training frame, the training motion features being used to characterize motion from the first training frame to the second training frame; Determining the set of training objects in the first training frame by generating a set of training semantic segmentation masks for the first training frame, the set of training semantic segmentation masks indicating respective positions of the set of training objects in the first training frame; performing the same spatial transformation on the training motion features and the set of training semantic segmentation masks, respectively; The transformed training motion features are partially convolved with the transformed set of training semantic segmentation masks to determine respective motion patterns of the set of training objects.

Description

Image animation method, electronic device and computer program product Background Image animation (image animation) refers to the generation of dynamic video from a still image in an automatic manner. Dynamic video is more vivid and expressive than static images. Therefore, the use experience of the user can be improved. Currently, image animation has been widely used to generate dynamic backgrounds, dynamic wallpaper, and the like. But the quality of the generated video still needs to be improved. Therefore, an image animation method capable of generating a high-quality video is required. Disclosure of Invention According to an implementation of the present disclosure, a scheme for generating video from an image is presented. In this scheme, an input image and a reference video are acquired. Based on the reference video, a motion pattern of a reference object in the reference video is determined. An output video is generated with an input image as a start frame, and a motion of a target object in the input image in the output video has a motion pattern of a reference object. In this way, the scheme can intuitively generate an output video based on the input image and the reference video, and the motion of the target object in the output video has the motion pattern of the reference object in the reference video. The summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Drawings FIG. 1 illustrates a block diagram of a computing device capable of implementing various implementations of the disclosure; FIG. 2 illustrates an architecture diagram of a system for image animation according to an implementation of the present disclosure; FIG. 3 illustrates an architecture diagram of a system for training an image animation model according to an implementation of the present disclosure; FIG. 4 illustrates a flow chart of a method of image animation according to an implementation of the present disclosure; FIG. 5 shows a flow chart of a method of training an image animation model in accordance with an implementation of the present disclosure, and In the drawings, the same or similar reference numerals are used to designate the same or similar elements. Detailed Description The present disclosure will now be discussed with reference to several example implementations. It should be understood that these implementations are discussed only to enable one of ordinary skill in the art to better understand and thus practice the present disclosure, and are not meant to imply any limitation on the scope of the present disclosure. As used herein, the term "comprising" and variants thereof are to be interpreted as meaning "including but not limited to" open-ended terms. The term "based on" is to be interpreted as "based at least in part on". The terms "one implementation" and "an implementation" are to be interpreted as "at least one implementation". The term "another implementation" is to be interpreted as "at least one other implementation". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below. As used herein, a "neural network" is capable of processing an input and providing a corresponding output, which generally includes an input layer and an output layer, and one or more hidden layers between the input layer and the output layer. Neural networks used in deep learning applications typically include many hidden layers, thereby extending the depth of the network. The layers of the neural network are connected in sequence such that the output of the previous layer is provided as an input to the subsequent layer, wherein the input layer receives the input of the neural network and the output of the output layer is provided as the final output of the neural network. Each layer of the neural network includes one or more nodes (also referred to as processing nodes or neurons), each of which processes input from a previous layer. CNNs are a type of neural network that includes one or more convolutional layers for performing convolutional operations on respective inputs. CNNs can be used in a variety of scenarios, particularly suited for processing image or video data. The terms "neural network", "network" and "neural network model" are used interchangeably herein. As described above, since dynamic video is more vivid and interesting than still images, the user's use experience can be improved. However, conventional image animation methods generally apply a previously obtained motion pattern directly to the entire image, regardless of the difference between semantics of different objects in the image, so that the generated output video is monotonous and hard. Additiona