CN-122027859-A - Dynamic media generation method, apparatus, device, storage medium, and program product

CN122027859ACN 122027859 ACN122027859 ACN 122027859ACN-122027859-A

Abstract

The present disclosure relates to a dynamic media generation method, apparatus, device, storage medium and program product. The method comprises the steps of receiving triggering operation for generating dynamic media, entering a data acquisition state to be processed, responding to the acquisition of the data to be processed, determining the data to be processed, performing visual conversion on the data to be processed according to a time sequence structure of the data to be processed, and generating the dynamic media based on the visual conversion, wherein an overlapping part exists between the time sequence of the dynamic media and the time sequence of the data to be processed. By utilizing the time sequence structure of the data to be processed to carry out visual conversion, the time sequence of the dynamic media and the time sequence of the data to be processed can be overlapped by maintaining the internal association and consistency of the time sequence structure, thereby realizing smooth and natural visual presentation and improving the quality of the generated dynamic media.

Inventors

LIU YU
XIE JIAYI
XIANG CHEN
SONG ZHIJUN
ZHANG PENG

Assignees

小米科技(武汉)有限公司
北京小米移动软件有限公司
北京小米松果电子有限公司

Dates

Publication Date: 20260512
Application Date: 20260121

Claims (20)

1. A method of dynamic media generation, comprising: Receiving triggering operation for generating dynamic media, and entering a data acquisition state to be processed; Determining to-be-processed data in response to the acquisition of the to-be-processed data, wherein the to-be-processed data at least comprises video data; according to the time sequence structure of the data to be processed, performing visual conversion on the data to be processed; Based on the visual transition, generating a dynamic medium, wherein the time sequence of the dynamic medium and the time sequence of the data to be processed have an overlapping part.
2. The dynamic media generation method of claim 1, wherein the determining the data to be processed in response to the acquisition of the data to be processed comprises: displaying at least one input operation item aiming at acquiring the data to be processed, wherein the input operation item is associated with the data to be processed; and determining at least one piece of data to be processed based on a trigger instruction of a user on the at least one input operation item.
3. The dynamic media generation method of claim 2, wherein the input operation item comprises a visual input operation item.
4. A dynamic media generation method as recited in claim 3, wherein the input operation items further include at least one of: A text input operation item; and inputting an operation item by voice.
5. The dynamic media generation method of claim 4, wherein the determining at least one of the data to be processed based on a trigger instruction of the user to the at least one input operation item comprises: based on a triggering instruction of a user on the visual input operation item, entering a data set, wherein the data set comprises at least one of video or pictures, and determining at least one piece of data to be processed according to the data set.
6. The dynamic media generation method of claim 3 or 4, wherein the performing visual transformation on the data to be processed according to the time sequence structure of the data to be processed comprises: and displaying the data to be processed determined according to the input operation item, and performing visual conversion on the data to be processed based on a generation instruction.
7. The dynamic media generation method of claim 1, wherein the dynamic media generation method further comprises: Displaying a style selection operation item, wherein the style selection operation item is associated with the dynamic media; Determining at least one target style based on a trigger instruction of a user on the display style operation item; The dynamic media is generated based on the data to be processed and at least one of the target styles.
8. The dynamic media generation method of claim 1, wherein the performing the visual transformation on the data to be processed according to the timing structure of the data to be processed comprises: Extracting time sequence motion information of the video data, wherein the time sequence motion information comprises time sequence structures of all frames and a target object motion track; and executing the visual conversion according to the time sequence motion information.
9. The dynamic media generation method of claim 8, wherein the data to be processed further comprises at least one of picture data or text data, the dynamic media generation method further comprising: And fusing the visual content of the picture data and/or the semantic information of the text data with the content characteristics of the video data to obtain fusion characteristics, and executing the visual conversion based on the fusion characteristics to generate the dynamic media.
10. The method of claim 1, wherein the data to be processed further comprises at least one of picture data or text data, and wherein the performing the visual transformation on the data to be processed according to the timing structure of the data to be processed comprises: extracting feature vectors of all frames of the video data; Extracting feature vectors of the picture data and/or feature vectors of the text data; Fusing the feature vectors of all frames with the feature vectors of the picture data and/or the feature vectors of the text data to obtain fusion features; And executing the visual transformation according to the fusion characteristic.
11. The method for generating dynamic media according to claim 1, wherein the performing visual transformation on the data to be processed according to the time sequence structure of the data to be processed and generating dynamic media based on the visual transformation comprises: Processing the data to be processed by using a video generation model, and generating and outputting the dynamic media; The video generation model comprises a spatiotemporal attention module which utilizes a spatiotemporal attention mechanism to process the features input to the spatiotemporal attention module and outputs a sequence of features with consecutive time sequences to generate the dynamic media.
12. The dynamic media generation method of claim 11, wherein the video generation model comprises: at least one preprocessing coder, wherein the preprocessing coder is used for coding the data to be processed to obtain corresponding feature vectors; the fusion layer is connected to the preprocessing encoder and is used for carrying out fusion processing on the feature vectors and outputting fusion features; a diffusion model encoder, coupled to the fusion layer, for converting and mapping the fused features into a diffusion network processable latent feature representation; The space-time attention module is connected to the diffusion model encoder and used for capturing the dependency relationship between the potential feature representation in the time dimension and the space dimension through the space-time attention mechanism and outputting a feature sequence with continuous time sequences; a diffusion model decoder, coupled to the spatiotemporal attention module, for decoding the feature sequence, outputting a sequence of image frames; And the post-processing module is connected to the diffusion model decoder and is used for carrying out optimization processing on the image frame sequence to obtain the dynamic media.
13. The method of claim 11, wherein the video generation model further comprises a style migration module for adjusting features input to the style migration module according to a target style to output stylized features.
14. The dynamic media generation method of claim 13, wherein the video generation model comprises: at least one preprocessing coder, wherein the preprocessing coder is used for coding the data to be processed to obtain corresponding feature vectors; the fusion layer is connected to the preprocessing encoder and is used for carrying out fusion processing on the feature vectors and outputting fusion features; The style migration module is connected to the fusion layer and used for carrying out style modulation on the fusion characteristics according to the target style and outputting stylized fusion characteristics; A diffusion model encoder, coupled to the style migration module, for converting and mapping the stylized, fused features into a diffusion network processable, latent feature representation; The space-time attention module is connected to the diffusion model encoder and used for capturing the dependency relationship between the potential feature representation in the time dimension and the space dimension through the space-time attention mechanism and outputting a feature sequence with continuous time sequences; a diffusion model decoder, coupled to the spatiotemporal attention module, for decoding the feature sequence, outputting a sequence of image frames; And the post-processing module is connected to the diffusion model decoder and is used for carrying out optimization processing on the image frame sequence to obtain the dynamic media.
15. The dynamic media generation method of claim 12 or 14, wherein the post-processing module is configured to perform at least one of: Performing resolution optimization processing on the image frame sequence; Performing frame inserting processing on the image frame sequence; compressing the image frame sequence; and carrying out format conversion processing on the image frame sequence.
16. The dynamic media generation method of claim 11, wherein the dynamic media generation method further comprises: Determining a target scene according to the data to be processed; and calling a video generation model corresponding to the target scene, and processing the data to be processed to generate dynamic media conforming to the target scene.
17. A dynamic media generation device, characterized in that the dynamic media generation device is adapted to implement the steps of the dynamic media generation method of any of claims 1-16.
18. An electronic device, comprising: A processor; A memory for storing processor-executable instructions; wherein the processor is configured to execute the executable instructions in the memory to implement the steps of the dynamic media generation method of any of claims 1-16.
19. A computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the dynamic media generation method of any of claims 1-16.
20. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the dynamic media generation method of any of claims 1 to 16.

Description

Dynamic media generation method, apparatus, device, storage medium, and program product Technical Field The present disclosure relates to the field of computer technology, and in particular, to a dynamic media generation method, apparatus, device, storage medium, and program product. Background With the increasing demands of users for personalized and immersive experiences, dynamic wallpaper has become an important development direction for visual interaction of electronic devices. However, the technology in the current field is still immature, and the problems of incoherence in time sequence, hard movement and the like often exist in the results generated by the existing method, so that natural and smooth visual expression is difficult to achieve. Disclosure of Invention To overcome the problems in the related art, the present disclosure provides a dynamic media generation method, apparatus, device, storage medium, and program product. According to a first aspect of an embodiment of the present disclosure, there is provided a dynamic media generation method, including: Receiving triggering operation for generating dynamic media, and entering a data acquisition state to be processed; Determining to-be-processed data in response to the acquisition of the to-be-processed data, wherein the to-be-processed data at least comprises video data; according to the time sequence structure of the data to be processed, performing visual conversion on the data to be processed; Based on the visual transition, generating a dynamic medium, wherein the time sequence of the dynamic medium and the time sequence of the data to be processed have an overlapping part. According to the technical scheme, the accurate acquisition of the data to be processed can be realized by receiving the corresponding triggering operation to automatically enter the data acquisition state to be processed, the time sequence structure of the data to be processed is utilized for visual conversion, and the time sequence of the dynamic media and the time sequence of the data to be processed can be overlapped by maintaining the internal association and consistency of the time sequence structure, so that smooth and natural visual presentation is realized, and the quality of the generated dynamic media is improved. In some possible embodiments, the determining the data to be processed in response to the acquiring the data to be processed includes: displaying at least one input operation item aiming at acquiring the data to be processed, wherein the input operation item is associated with the data to be processed; and determining at least one piece of data to be processed based on a trigger instruction of a user on the at least one input operation item. In the technical scheme, the input operation item associated with the data to be processed is provided, so that a user can accurately trigger the instruction according to the need, the required data to be processed can be obtained quickly, and the operation convenience and the data acquisition efficiency are improved. In some possible embodiments, the input operation item includes a visual input operation item. In the technical scheme, the required video data can be conveniently and efficiently acquired by utilizing the visual input operation items. In some possible embodiments, the input operation item includes at least one of: Text input operation item and voice input operation item. In the technical scheme, the multiple types of input operation items can meet different user demands, so that the interaction flexibility is improved, and the user can conveniently and efficiently determine the data to be processed. In some possible embodiments, the determining at least one data to be processed based on a trigger instruction of the user to the at least one input operation item includes: based on a triggering instruction of a user on the visual input operation item, entering a data set, wherein the data set comprises at least one of video or pictures, and determining at least one piece of data to be processed according to the data set. According to the technical scheme, the visual trigger instruction enters the data set to select the data, so that the video or picture data to be processed meeting the requirements can be intuitively, conveniently, quickly and accurately determined. In some possible embodiments, the visual transformation of the data to be processed according to the time sequence structure of the data to be processed includes: and displaying the data to be processed determined according to the input operation item, and performing visual conversion on the data to be processed based on a generation instruction. According to the technical scheme, the visual conversion is triggered by displaying the selected data and combining the generation instruction, so that the visual operation can be ensured to be visual, the process can be controlled, and further the generated dynamic media is ensured to be more