CN-122027867-A - Video generation method, electronic device and storage medium

CN122027867ACN 122027867 ACN122027867 ACN 122027867ACN-122027867-A

Abstract

The specification provides a video generation method, an electronic device and a storage medium. The video generation method comprises the steps that a server side can acquire a video generation request from a client side, wherein the video generation request comprises material data, and the material data at least comprises description text, and the description text is used for describing video content of a video to be generated. And the server side can acquire a corresponding target video template by utilizing a large language model according to the description text so as to enable the client side or the server side to render based on the target video template and the material data to generate a target video, wherein the target video template comprises an editing template of the target video and a generation template of one or more video clips, the editing template comprises video clip information of a global dimension, and the generation template comprises generation information of the corresponding video clips.

Inventors

WANG SHUANGLI
HU HAOYANG
FAN WANLI
SUN SHIQUAN
LUO MINHUI

Assignees

支付宝(杭州)数字服务技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260213

Claims (19)

1. A video generation method applied to a server, the method comprising: Acquiring a video generation request from a client, wherein the video generation request comprises material data, the material data at least comprises description text, the description text is used for describing video content of a video to be generated, and And according to the description text, a large language model is utilized to acquire a corresponding target video template, so that the client or the server performs rendering based on the target video template and the material data to generate a target video, wherein the target video template comprises an editing template of the target video and a generation template of one or more video clips, the editing template comprises video clip information of a global dimension, and the generation template comprises generation information of the corresponding video clips.
2. The method of claim 1, wherein the obtaining, according to the descriptive text, a corresponding target video template using a large language model to enable the client or the server to render based on the target video template and the material data, and generating a target video includes: Acquiring a corresponding target video template by using a large language model according to the description text, and sending the target video template to the client side, and And responding to receiving a rendering instruction from the client, wherein the rendering instruction is used for indicating the server to render based on the target video template and material data corresponding to the target video template, and generating the target video, and the rendering instruction is generated by the client based on the operation of a user on the target video template.
3. The method of claim 1, wherein the obtaining, according to the descriptive text, a corresponding target video template using a large language model to enable the client or the server to render based on the target video template and the material data, and generating a target video includes: And according to the description text, acquiring a corresponding target video template by using a large language model, and sending the target video template to the client so that the client renders based on the target video template and the material data to generate the target video.
4. The method according to any one of claims 1-3, wherein the obtaining, according to the descriptive text, the corresponding target video template using a large language model includes: performing video clip dimension reasoning on the description text by using the large language model, and determining one or more video clips corresponding to the target video; Acquiring one or more generated templates corresponding to the video clips by using the large language model, acquiring an editing template of the target video, and And obtaining the target video template based on the generated template corresponding to one or more video clips and the editing template of the target video, and sending the target video template to the client.
5. The method of claim 4, wherein the video clip comprises a digital person video clip, and the material data further comprises a digital person script; the obtaining, by using the large language model, one or more generated templates corresponding to the video clips includes: And carrying out video generation dimension reasoning on the digital human script by using the large language model to obtain a generation template corresponding to the digital human video fragment.
6. The method of claim 5, wherein the performing, with the large language model, reasoning about the video generation dimension of the digital human script to obtain a generation template corresponding to the digital human video clip, comprises: Generating a plurality of extension scripts based on the digital human script using the large language model, and And carrying out video generation dimension reasoning on the plurality of extension scripts by using the large language model to obtain a generation template corresponding to each of the plurality of digital human video clips.
7. The method of claim 6, wherein the obtaining the target video template based on the generated templates corresponding to the one or more video clips and the edited templates of the target video comprises: Sending the generated templates corresponding to the digital person video clips to the client; determining a target generation template corresponding to the digital human video clip in response to receiving a selection instruction from the client, wherein the selection instruction is generated by the client in response to user operation and is used for indicating one of a plurality of generation templates to be the target generation template corresponding to the digital human video clip, and And obtaining the target video template based on the target generation template corresponding to the digital human video segment, one or more generation templates corresponding to the video segment and the editing template of the target video.
8. The method of claim 4, wherein the video clip comprises a vector animation clip, and the material data further comprises a description file of a vector animation; the obtaining, by using the large language model, one or more generated templates corresponding to the video clips includes: And carrying out video generation dimension reasoning on the description file of the vector animation by using the large language model to obtain a generation template corresponding to the vector animation fragment.
9. The method of claim 4, wherein obtaining an edit template of the target video using the large language model comprises: And carrying out reasoning on video clip dimensions based on one or more generated templates corresponding to the video clips and the description text by using the large language model, and obtaining an editing template of the target video.
10. The method of claim 4, wherein the generating the target video based on rendering the rendering instructions in response to receiving the rendering instructions from the client comprises: Rendering based on the generated templates corresponding to one or more video clips to obtain one or more video clips, and Rendering is performed based on one or more video clips, the material data and an editing template of the target video, and the target video is generated.
11. The method of claim 10, wherein the video clip comprises a digital human video clip; rendering is carried out based on the generated template corresponding to the digital human video segment to obtain the digital human video segment, and the method comprises the following steps: Inputting a generation template corresponding to the digital human video segment into the digital human video generation model to obtain the digital human video segment rendered and generated based on the digital human video generation model, wherein the digital human video generation model is configured to have the capability of rendering and generating digital human video based on the generation template.
12. The method of claim 11, wherein the generation template corresponding to the digital human video clip at least comprises an editing template, a digital human tone configuration parameter, a digital human figure configuration parameter, a digital human speech configuration parameter and a digital human behavior configuration parameter of the digital human video clip; Inputting a generation template corresponding to the digital human video segment into the digital human video generation model to obtain the digital human video segment rendered and generated by the digital human video generation model, wherein the method comprises the following steps: Invoking an audio generation service and a caption generation service using a digital human video generation model, generating audio data and caption data based on the digital human tone configuration parameters and the digital human speech configuration parameters, and And rendering and generating the digital human video clips based on the editing templates of the digital human video clips by utilizing a digital human video generation model, wherein the digital human behavior configuration parameters, the digital human image configuration parameters, the audio data, the subtitle data and a preset mouth shape driving service.
13. A video generation method applied to a client, the method comprising: determining material data of a target video, generating a video generation request based on the material data, and sending the video generation request to a server, wherein the material data at least comprises description text, and the description text is used for describing video content of the target video; Receiving a target video template from a server, wherein the target video template comprises an editing template of the target video and a generation template of one or more video clips, the editing template comprises video clip information of a global dimension, the generation template comprises generation information of corresponding video clips, and And generating a rendering instruction in response to receiving an operation from a user on the target video template, and sending the rendering instruction to the server, wherein the rendering instruction is used for indicating the server to render based on the target video template and the material data corresponding to the target video template, so as to generate the target video.
14. The method of claim 13, wherein the user operation on the target video template comprises an editing operation; The response to receiving the operation of the user on the target video template, generating a rendering instruction, and sending the rendering instruction to the server, including: Editing a target video template in response to a received editing operation, the editing operation including one or more of modifying the target video template, replacing material in the target video template, and importing material into the target video template, and And generating the rendering instruction based on the edited target video template and sending the rendering instruction to the server.
15. The method of claim 13, wherein the video clip comprises a digital person video clip, the material data further comprises a digital person script, and the generation template corresponding to the video clip comprises a generation template corresponding to the digital person video clip, the method further comprising: Receiving and displaying a plurality of corresponding generation templates of digital human video clips from the server, wherein the generation templates are generated by the server based on the digital human scripts by utilizing the large language model, the plurality of generation templates are obtained by the server by utilizing a plurality of expansion scripts obtained by expanding the digital human scripts by utilizing the large language model and are obtained by reasoning video generation dimensions, and And responding to the received selection operation, and sending a selection instruction to the server, wherein the selection instruction is used for indicating one of a plurality of generated templates to generate a template for a target corresponding to the digital human video segment.
16. The method of claim 13, wherein the client has a browser application running thereon; The determining the material data of the target video comprises the following steps: displaying a target page by using the browser application program, and determining material data of a target video based on response of the target page to user operation; Correspondingly, the responding to the receiving of the operation of the user on the target video template generates a rendering instruction and sends the rendering instruction to the server side, and the method comprises the following steps: And receiving the operation of a user on the target video template based on the target page, generating a rendering instruction based on the target video template, and sending the rendering instruction to the server.
17. An electronic device, as a server, comprising: At least one storage medium storing at least one instruction set for generating a target video, and At least one processor in communication with the at least one storage medium, wherein the at least one processor reads the at least one instruction set when the electronic device is operating and implements the method of any of claims 1-12 as indicated by the at least one instruction set.
18. An electronic device, as a client, comprising: At least one storage medium storing at least one instruction set for generating a target video, and At least one processor in communication with the at least one storage medium, wherein the at least one processor reads the at least one instruction set when the electronic device is operating and implements the method of any of claims 13-16 as indicated by the at least one instruction set.
19. A computer readable non-volatile storage medium having stored therein at least one set of instructions which, when executed by at least one processor, implements the method of any of claims 1-12 or 13-16.

Description

Video generation method, electronic device and storage medium Technical Field The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a video generating method, an electronic device, and a storage medium. Background Video is an important content carrier for modern marketing and is widely applied to various scenes such as brand propaganda, product popularization, social media delivery, user interaction and the like. There is a great need for high quality, personalized video for each business. Currently, video is typically required to be produced manually. For example, when a video is produced, a video template is usually selected manually, professional video production software is used on professional equipment, and links such as post editing, special effect synthesis and the like are performed based on the video template, so that a required video is produced. However, this video production method is time-consuming, the video production efficiency is low, and the video effect depends on the experience of the production personnel, resulting in long production period of the video, high dependence on experience and difficult mass production. The content of the background section is only information known to the inventors and is not representative of the information that has entered a public context prior to the filing date of this disclosure, nor of the prior art to which this disclosure may be directed. Disclosure of Invention The specification provides a video generation method, electronic equipment and storage medium, which can acquire a corresponding target video template by using a large language model based on a description text describing target video content, and render the target video based on the target video template. The video production duration can be reduced, and the efficiency and quality of video production are improved. In order to achieve the above purpose, the embodiment of the present specification adopts the following technical scheme: According to the first aspect, the specification provides a video generation method, which is applied to a server, and comprises the steps of obtaining a video generation request from a client, wherein the video generation request comprises material data, the material data at least comprises description text, the description text is used for describing video content of a video to be generated, and obtaining a corresponding target video template by using a large language model according to the description text, so that the client or the server can render the target video based on the target video template and the material data to generate the target video, the target video template comprises an editing template of the target video and a generation template of one or more video fragments, the editing template comprises video clip information of a global dimension, and the generation template comprises generation information of the corresponding video fragments. In some embodiments, the method for obtaining the corresponding target video template by using the large language model according to the description text so that the client or the server performs rendering based on the target video template and the material data to generate a target video comprises obtaining the corresponding target video template by using the large language model according to the description text and sending the target video template to the client, and responding to receiving a rendering instruction from the client, wherein the rendering instruction is used for indicating the server to perform rendering based on the target video template and the material data corresponding to the target video template to generate the target video, and the rendering instruction is generated by the client based on the operation of the client on the target video template. In some embodiments, the obtaining the corresponding target video template by using a large language model according to the description text so that the client or the server renders the target video based on the target video template and the material data to generate a target video includes obtaining the corresponding target video template by using a large language model according to the description text and sending the target video template to the client so that the client renders the target video based on the target video template and the material data to generate the target video. In some embodiments, the method for obtaining the corresponding target video template by using the large language model according to the description text comprises the steps of performing video clip dimension reasoning on the description text by using the large language model, determining one or more video clips corresponding to the target video, obtaining a generation template corresponding to one or more video clips by using the large language model, obtaining an editing template of the target video, obta