CN-122002097-A - Video content editing method and electronic equipment

CN122002097ACN 122002097 ACN122002097 ACN 122002097ACN-122002097-A

Abstract

The embodiment of the application discloses a video editing generation method and electronic equipment, which comprise the steps of receiving original video materials submitted by users and video creation requirement information expressed by natural language, respectively preprocessing the original video materials and the video creation requirements, carrying out scene segmentation and picture content understanding on a key frame picture sequence through an AI image understanding model, generating a video understanding result in a text format, carrying out reasoning on the video creation requirements and the video understanding result through an AI language model, generating a video editing scheme by combining video editing knowledge, determining tools corresponding to an atomization task according to the video editing scheme, and calling the tools after constructing required parameter information for the tools so as to generate a target video. By the embodiment of the application, the efficiency and the production ratio of video creative production can be improved.

Inventors

LI MOYANG
LIU WENWEN
YI DANLI
WU XIANGFEI
CAO YANG
LIN SHIYANG
LIU YOUCUN
Wu Zhangjiani
CHEN ZHIQI
CHEN QI

Assignees

杭州阿里巴巴海外互联网产业有限公司
阿里巴巴新加坡控股有限公司

Dates

Publication Date: 20260508
Application Date: 20251219

Claims (18)

1. A method of video content editing comprising: receiving original video materials submitted by users and video creation demand information expressed by natural language; Preprocessing the original video material and the video creation requirement respectively so as to disassemble the original video material into a plurality of key frame picture sequences and convert the video creation requirement expressed by natural language into a machine-readable format; performing scene segmentation and picture content understanding on the key frame picture sequence through an artificial intelligence AI image understanding model, and generating a text-format video understanding result; Reasoning the video creation requirement and the video understanding result through an AI language model, and generating a video editing scheme by combining video editing knowledge, wherein the video editing scheme comprises a composition structure of a target video to be generated and a plurality of atomization tasks to be executed; And determining a tool corresponding to the atomization task according to the video editing scheme, and calling the tool after constructing required parameter information for the tool so as to generate a target video.
2. The method as recited in claim 1, further comprising: And identifying and confirming the video creation requirement of the user by utilizing natural language processing, semantic understanding technology and interaction with the user.
3. The method of claim 1, wherein the step of determining the position of the substrate comprises, When preprocessing the original video material, the method further comprises the following steps: And adding auxiliary information into the key frame picture, wherein the auxiliary information is used for representing time sequence information of the key frame in the original video material.
4. The method of claim 3, wherein the step of, The auxiliary information comprises position information of the key frames on a time axis of the original video material; The atomization task comprises a video paragraph segmentation task, and parameters required by the video paragraph segmentation task comprise paragraph segmentation point position information which is expressed through the position information on the time axis; when the AI language model generates a video clipping scheme, the segmentation point position of a paragraph is determined in a fuzzy manner according to the time axis information in the picture content of the key frame picture; the constructing the required parameter information for the tool includes: determining a plurality of video frames in a target time range from the original video material according to the fuzzy determined paragraph segmentation point position information; And determining a scene switching key frame position by comparing the front frame difference and the rear frame difference of a plurality of video frames in the target time range, and accurately determining the paragraph segmentation point position according to the scene switching key frame position.
5. The method of claim 4, wherein the step of determining the position of the first electrode is performed, When precisely determining the paragraph segmentation point position, the method further comprises the following steps: And performing silence segment detection on the segments in the target time range in the original video material so as to comprehensively determine the segment dividing point positions according to the scene switching key frame positions and the positions of the silence segments.
6. The method of claim 1, wherein the step of determining the position of the substrate comprises, When preprocessing the original video material, the method further comprises the following steps: and extracting voice content from the original video material, and converting the voice content into text content, so that the AI image understanding model is combined with the text content obtained by converting the voice content to generate a video understanding result.
7. The method of claim 1, wherein the step of determining the position of the substrate comprises, The scene segmentation and picture content understanding are carried out on the key frame picture sequence through an AI image understanding model, and a video understanding result in a text format is generated, which comprises the following steps: Generating text description information of the whole original video material, and generating segmented video understanding results according to scene segmentation conditions.
8. The method of claim 1, wherein the step of determining the position of the substrate comprises, The generating a video clip scheme includes: And when the prompt information of the AI language model is generated, dynamically determining required video clip knowledge according to the video creation requirement, and adding the video clip knowledge into the prompt information.
9. The method of claim 1, wherein the step of determining the position of the substrate comprises, The video creation requirements include the requirement of mixing and shearing at least two original video materials to generate more videos; the AI language model is specifically used in generating a video clip scheme: And carrying out correlation analysis on the video paragraphs separated from the different video materials, and determining video paragraphs suitable for cross-material combination in the different video materials so as to generate a plurality of target videos by combining the video paragraphs in the different video materials in a cross-material mode.
10. The method according to any one of claims 1 to 9, further comprising: After the plurality of atomization tasks to be executed are determined, the plurality of atomization tasks are arranged, and a clipping workflow is generated so as to carry out corresponding tool calling according to the clipping workflow.
11. The method according to any one of claims 1 to 9, further comprising: And providing the execution progress information of the atomization task for a client so as to display the execution progress information through the client.
12. The method according to any one of claims 1 to 9, further comprising: And generating text abstract content according to a video clip scheme corresponding to the target video, and providing the text abstract content to a client so as to display the text abstract content through the client.
13. A video content delivery method, comprising Receiving at least two original video materials submitted by a user and video creation requirement information expressed by natural language, wherein the video creation requirement comprises the requirement of mixing and shearing the at least two original video materials to generate more videos; Preprocessing the original video material and the video authoring requirement respectively so as to disassemble the original video material into a plurality of key frame picture sequences respectively, and converting the video authoring requirement expressed by natural language into a machine-readable format; performing scene segmentation and picture content understanding on the key frame picture sequence through an artificial intelligence AI image understanding model, and generating a text-format video understanding result; The video editing method comprises the steps of reasoning the video creation requirement and the video understanding result through an AI language model, and generating a video editing scheme by combining video editing knowledge, wherein the video editing scheme comprises a composition structure of a plurality of target videos to be generated and a plurality of atomization tasks to be executed; And determining a tool corresponding to the atomization task according to the video clipping scheme, and calling the tool after constructing required parameter information for the tool so as to generate a plurality of target videos, so that the plurality of target videos are thrown to at least one target system.
14. The method of claim 13, wherein the step of determining the position of the probe is performed, The atomization task at least comprises a video paragraph segmentation task and a video splicing task.
15. A video content editing system, comprising a plurality of agents, the agents comprising: The demand analysis Agent is used for interacting with the user by utilizing natural language processing and semantic understanding technology after receiving original video material submitted by the user and video creation demand information expressed by natural language, identifying and confirming the video creation demand of the user and converting the video creation demand into a machine-readable format; The video preprocessing Agent is used for respectively preprocessing the original video material and the video creation requirement and disassembling the original video material into a plurality of key frame picture sequences; the video understanding Agent is used for carrying out scene segmentation and picture content understanding on the key frame picture sequence through an artificial intelligence AI image understanding model and generating a video understanding result in a text format; The creative planning Agent is used for reasoning the video creation requirement and the video understanding result through an AI language model, and generating a video editing scheme by combining video editing knowledge, wherein the video editing scheme comprises a composition structure of a target video to be generated and a plurality of atomization tasks to be executed; and the video editing Agent is used for determining a tool corresponding to the atomization task according to the video editing scheme, and calling the tool after constructing required parameter information for the tool so as to generate a target video.
16. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method of any of claims 1 to 14.
17. An electronic device, comprising: one or more processors, and A memory associated with the one or more processors for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of claims 1 to 14.
18. A computer program product comprising computer program/computer-executable instructions which, when executed by a processor in an electronic device, implement the steps of the method of any one of claims 1 to 14.

Description

Video content editing method and electronic equipment Technical Field The present application relates to the field of video editing technologies, and in particular, to a video content editing method and an electronic device. Background For operators in a commodity information service system (also referred to as an "e-commerce platform") to reach the purpose of popularizing an application client of the system and realizing the increase of user quantity, video advertisement delivery to an external platform is often required. Because the video information is easier to attract the clicking of the user, and the user can be guided to finish downloading the related application program client, compared with the picture and other advertisement information, the video information can generally obtain higher conversion rate, and therefore, the short video content has become an important engine for the marketing of the electronic commerce. With the explosive growth of short video platforms, unprecedented exposure opportunities are brought to short video content, but at the same time, higher requirements are also put on the production efficiency of video creatives. Especially for cross-border e-commerce platforms, not only multi-platform delivery is needed, but also problems of multiple countries, multiple languages and the like are involved, and higher requirements on the number of videos and production efficiency are met. The main difficulties faced in the current short video production and delivery process include short life cycle of online video materials, frequent replacement, insufficient supply of internal high-quality content, higher external purchasing cost and the like, and large-scale coverage cannot be realized. These factors limit the efficiency and the production ratio of video creative production, and need to be solved by technical means. Disclosure of Invention The application provides a video content editing method and electronic equipment, which can improve the efficiency and the production ratio of video creative production. The application provides the following scheme: a video content editing method, comprising: receiving original video materials submitted by users and video creation demand information expressed by natural language; Preprocessing the original video material and the video creation requirement respectively so as to disassemble the original video material into a plurality of key frame picture sequences and convert the video creation requirement expressed by natural language into a machine-readable format; performing scene segmentation and picture content understanding on the key frame picture sequence through an artificial intelligence AI image understanding model, and generating a text-format video understanding result; Reasoning the video creation requirement and the video understanding result through an AI language model, and generating a video editing scheme by combining video editing knowledge, wherein the video editing scheme comprises a composition structure of a target video to be generated and a plurality of atomization tasks to be executed; And determining a tool corresponding to the atomization task according to the video editing scheme, and calling the tool after constructing required parameter information for the tool so as to generate a target video. Wherein, still include: And identifying and confirming the video creation requirement of the user by utilizing natural language processing, semantic understanding technology and interaction with the user. Wherein when preprocessing the original video material, the method further comprises: And adding auxiliary information into the key frame picture, wherein the auxiliary information is used for representing time sequence information of the key frame in the original video material. Wherein the auxiliary information comprises position information of the key frame on a time axis of the original video material; The atomization task comprises a video paragraph segmentation task, and parameters required by the video paragraph segmentation task comprise paragraph segmentation point position information which is expressed through the position information on the time axis; when the AI language model generates a video clipping scheme, the segmentation point position of a paragraph is determined in a fuzzy manner according to the time axis information in the picture content of the key frame picture; the constructing the required parameter information for the tool includes: determining a plurality of video frames in a target time range from the original video material according to the fuzzy determined paragraph segmentation point position information; And determining a scene switching key frame position by comparing the front frame difference and the rear frame difference of a plurality of video frames in the target time range, and accurately determining the paragraph segmentation point position according to the scene switching key frame