CN-122027847-A - Marketing video automatic editing method based on machine learning

CN122027847ACN 122027847 ACN122027847 ACN 122027847ACN-122027847-A

Abstract

The invention discloses a marketing video automatic editing method based on machine learning, relates to the technical field of artificial intelligence video processing, and solves the problems that marketing video editing efficiency is low and content is disjointed with user preference. According to the method, the video segments are extracted to analyze the visual and audio characteristics to identify the key emotion climax segments, a attention mechanism is adopted to construct a narrative path, an optimized sequence is used for generating personalized videos, transitional effects are inserted, and determined parameters are rendered, so that complete video optimized interaction potential is generated, automatic intelligent editing is realized, and content creation efficiency and user interaction capability are improved.

Inventors

XU ZHE

Assignees

大力奇迹(杭州)科技有限责任公司

Dates

Publication Date: 20260512
Application Date: 20260225

Claims (10)

1. A machine learning based marketing video automatic editing method, comprising the steps of: s1, extracting an original frame sequence from a video material library, and analyzing visual characteristics and audio signals by adopting a computer visual algorithm to obtain a preliminary set of key emotion climax fragments; S2, calculating the association strength among the segments by adopting an attention mechanism model according to the preliminary set, and if the association strength is higher than a preset threshold, classifying the related segments into the same narrative group, and determining candidate structures of the continuous narrative paths; s3, acquiring the narrative paths in the candidate structure, sequencing and optimizing the sequence of the segments in each path, evaluating the logic fluency through a sequence modeling algorithm, and judging the optimal narrative sequence; s4, extracting audio and visual elements from the optimal narrative sequence, and fusing the elements and target audience preference data by adopting a generation countermeasure network to obtain an adjusted personalized video segment; s5, constructing an overall video frame according to the personalized video clips, and if the sound continuity of the clip transition points in the frame is lower than a threshold value, inserting a transition effect to obtain a smoothly connected video draft; s6, rendering the video draft, analyzing the overall duration and rhythm distribution of the draft, and determining the clipping parameters of a final output version; and S7, acquiring the clipping parameters, applying the clipping parameters to a video draft to generate a complete marketing video, and judging audience interaction potential of the video to obtain optimized final content through the embedded metadata tag.
2. The method according to claim 1, wherein in S1, the process of obtaining the preliminary set of key emotion climax fragments comprises: Extracting an original frame sequence through a video material library, carrying out feature extraction of color distribution and brightness comparison on each frame by adopting an image processing tool, and simultaneously carrying out feature grabbing of waveform amplitude and frequency change on corresponding audio signals by utilizing an audio processing tool to obtain a visual feature data set and an audio feature data set; According to the visual characteristic data set and the audio characteristic data set, carrying out combination comparison on the visual characteristic of each frame and the audio characteristic of the corresponding time period, and if the color saturation or brightness change in the visual characteristic exceeds a preset threshold value and the amplitude or frequency change in the audio characteristic also exceeds the preset threshold value, judging that the frame is a potential emotion climax fragment, and obtaining a preliminary emotion climax fragment set; Aiming at the preliminary emotion climax fragment set, adopting a time sequence comparison tool to continuously detect the characteristic change of adjacent frames, classifying the frames into the same emotion climax fragment if the visual characteristic and the audio characteristic change trend of the adjacent frames are consistent, and determining the final emotion climax fragment group; and performing time stamp marking and fragment extraction on the fragments by utilizing a video editing tool through the final emotion climax fragment grouping, and separating key emotion climax fragments from the original video material to obtain an independent emotion climax video fragment set for subsequent processing.
3. The method of claim 1 wherein in S2 the process of determining candidate structures for a coherent narrative path comprises: According to the fragment data in the preliminary set, carrying out quantization processing on semantic relativity among fragments by adopting an attention computing tool to generate a correlation strength value among fragments, so as to obtain a preliminary matrix of the correlation strength; Screening the preliminary matrix of the association strength by adopting a preset threshold value, if the association strength value among the segments is higher than the preset threshold value, classifying the segments into the same narrative group, and determining a preliminary narrative group division result; Obtaining the segment content in each narrative group from the preliminary narrative group division result, detecting the consistency of the segments in the group by using a semantic comparison tool, judging whether the segments with low content correlation degree exist or not, and if so, adjusting the segment attribution to obtain an optimized narrative group set; and according to the optimized narrative group set, adopting a path construction tool to arrange the logic sequences among the groups, generating a continuous narrative path candidate structure through a time line sequencing tool, and determining a final path scheme.
4. The method of claim 1 wherein in S3 the process of determining the optimal narrative sequence comprises: acquiring a narrative path in at least one candidate structure through a pre-established narrative segment database, primarily arranging segment sequences in the path, and checking front-to-back logic relations among the segments by adopting a time stamp comparison tool to obtain a primarily ordered path sequence; According to the path sequence after preliminary sequencing, analyzing semantic association degree between adjacent segments by using a content engagement detection tool, and if the association degree is lower than a preset threshold value, logically filling the content between the segments by using a semantic complementation tool to determine an optimized narrative path sequence; Scoring the overall logic fluency of the path by adopting a text fluency assessment tool aiming at the optimized narrative path sequence, obtaining a path sequence with the highest score, and judging a candidate narrative sequence meeting logic requirements; and carrying out matching degree analysis on the candidate narrative sequence and a preset narrative target through a comparison tool, outputting the sequence to be used as an optimal narrative sequence if the matching degree reaches a preset threshold, otherwise, carrying out fine adjustment on the sequence of the segments by utilizing a logic adjustment tool to obtain a final optimal narrative sequence.
5. The method according to claim 1, wherein in S4, the process of obtaining the adjusted personalized video clip comprises: Extracting audio elements and visual elements from the optimal narrative sequence, dividing the audio elements into background audio tracks and dialogue audio tracks by adopting a pre-established audio separation tool and an image segmentation tool, dividing the visual elements into dynamic pictures and static pictures, acquiring separated audio fragments and visual fragments, and determining original materials for subsequent processing; Performing preliminary alignment treatment on the audio clips and the visual clips by adopting a content matching tool, and if the time length difference between the audio clips and the visual clips exceeds a preset threshold value, compressing or stretching the audio clips by using a time length adjusting tool to obtain aligned audio-video combined materials; according to the audio-visual combined material, a pre-established audience preference feature library is obtained, a feature extraction tool is adopted to extract emotion features and style features in the audio-visual combined material, if the matching degree of the emotion features and the audience preference feature library is lower than a preset threshold, the style conversion tool is used for adjusting the tone of the visual segment, and whether the adjusted visual segment accords with the preference is judged; And aiming at the adjusted visual fragment and audio fragment, adopting a depth fusion tool to integrate, and calibrating the emotion intensity of the audio and the picture expression of the visual fragment by using a content integration tool to obtain a final personalized video fragment, thereby determining that the target requirement is met.
6. The method of claim 1, wherein in S5, the process of obtaining the smoothly connected video draft comprises: Acquiring starting and ending time points of each segment according to time sequence data of personalized video segments, extracting audio feature data corresponding to the segments from an audio track, detecting audio continuity of transition points of adjacent segments, and judging whether sound continuity is lower than a standard or not by comparing difference values of audio features with a preset threshold value to obtain a detection result of the transition points; If the detection result shows that the audio continuity of a certain transition point is lower than a preset threshold value, at least one matched audio transition effect is obtained from a pre-established transition effect library, the adaptation processing is carried out on the audio characteristics of the transition point, and an insertion scheme of the transition effect is determined; adding a transition effect in the insertion scheme to a corresponding transition point, obtaining an adjusted video clip sequence, performing smooth processing on the adjusted audio data, and performing fine adjustment on a joint by adopting an audio editing tool to obtain a smoothly connected video combination; And acquiring time sequence structure data of the whole video according to the smoothly connected video combination, carrying out audio-video synchronization verification on the joint of the video combination and the transition effect, and adjusting the unsynchronized part through a video editing tool to determine a final video draft.
7. A machine learning based marketing video automatic editing system for implementing the method of any of claims 1-6, the system comprising: The segment extraction module is used for extracting an original frame sequence from the video material library, analyzing visual characteristics and audio signals by adopting a computer visual algorithm, and obtaining a preliminary set of key emotion climax segments; the narrative group construction module is used for calculating the association strength among the segments by adopting an attention mechanism model according to the preliminary set, if the association strength is higher than a preset threshold, classifying the relevant segments into the same narrative group, and determining candidate structures of the continuous narrative paths; the sequence optimization module is used for acquiring the narrative paths in the candidate structure, carrying out sequencing optimization on the sequence of the segments in each path, evaluating the logic fluency through a sequence modeling algorithm, and judging the optimal narrative sequence; The personalized generation module is used for extracting audio and visual elements from the optimal narrative sequence, and fusing the elements with target audience preference data by adopting a generation countermeasure network to obtain an adjusted personalized video clip; The frame construction module is used for constructing an integral video frame according to the personalized video clips, and if the sound continuity of the clip transition points in the frame is lower than a threshold value, a transition effect is inserted to obtain a smoothly connected video draft; the rendering processing module is used for rendering the video draft, analyzing the overall duration and rhythm distribution of the draft and determining the clipping parameters of the final output version; The video generation module is used for acquiring the clipping parameters, applying the clipping parameters to a video draft to generate a complete marketing video, judging audience interaction potential of the video, and obtaining optimized final content through the embedded metadata tag.
8. A computer terminal device, comprising: one or more processors; A memory coupled to the processor for storing one or more programs; when executed by the one or more processors, causes the one or more processors to implement the steps of the method of any of claims 1-6.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-6.
10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-6.

Description

Marketing video automatic editing method based on machine learning Technical Field The invention belongs to the technical field of artificial intelligence video processing, and particularly relates to an automatic marketing video editing method based on machine learning. Background In the present digital marketing era, video content has become a central means of brand spread and consumer appeal, with a continuing expansion of impact. However, there is a significant conflict between the vigorous video marketing requirements and the existing content authoring technology capabilities. Currently, the traditional video editing mode mainly depends on manual operation, and the process is time-consuming, high in cost and more difficult to adapt to the requirements of large-scale and fast-paced content production. Although some automatic editing tools exist on the market, most of the automatic editing tools run based on preset fixed rules, lack of understanding of deep semantic and emotional values of video content, and cannot be dynamically adjusted and personalized according to the preferences of different target audiences. This technical limitation results in that the final generated video content often has a gap from the real expectations of the users, and it is difficult to effectively excite the interests and interactive willingness of the audience. A further technical challenge is how to precisely locate and extract the most attractive segments from a large number of complex video materials and organically organize the segments into a logically coherent and powerful narrative whole. The dimensions of the pictures, the sounds and the emotion information contained in the video material are multiple, the relationship is complex, the prior art is difficult to comprehensively and accurately capture the internal association between the multi-mode information, and the omission or misjudgment of key emotion climax content is often caused. For example, in a marketing video, a key picture that can cause the viewer's emotion to resonate may be ignored, while some of the even and uninteresting segments remain. This lack of information processing further results in loose, logically broken video narratives generated by automatic editing, severely impacting the viewing experience and content distribution effects of the audience. Therefore, a technical solution that can deeply fuse and understand video content, intelligently identify key segments, and automatically construct a fluent narrative structure is urgently needed in the industry. The method solves the key problem of how to efficiently and accurately mine high-value content from original materials, and intelligently synthesizes the high-value content into marketing videos which not only accord with narrative logic but also can accurately reach target audiences, and has important significance for improving brand market competitiveness and user connection capability. Previously, in achieving this goal, the main difficulty was the lack of comprehensive intelligent methods that could effectively synergistically analyze visual and audio features and make decisions based on advanced semantic and emotional logic. Disclosure of Invention In order to solve the technical problems, the invention provides a marketing video automatic editing method based on machine learning, which aims to solve the problems existing in the prior art. In order to achieve the above object, the present invention provides a method for automatically editing marketing video based on machine learning, comprising the steps of: s1, extracting an original frame sequence from a video material library, and analyzing visual characteristics and audio signals by adopting a computer visual algorithm to obtain a preliminary set of key emotion climax fragments; S2, calculating the association strength among the segments by adopting an attention mechanism model according to the preliminary set, and if the association strength is higher than a preset threshold, classifying the related segments into the same narrative group, and determining candidate structures of the continuous narrative paths; s3, acquiring the narrative paths in the candidate structure, sequencing and optimizing the sequence of the segments in each path, evaluating the logic fluency through a sequence modeling algorithm, and judging the optimal narrative sequence; s4, extracting audio and visual elements from the optimal narrative sequence, and fusing the elements and target audience preference data by adopting a generation countermeasure network to obtain an adjusted personalized video segment; s5, constructing an overall video frame according to the personalized video clips, and if the sound continuity of the clip transition points in the frame is lower than a threshold value, inserting a transition effect to obtain a smoothly connected video draft; s6, rendering the video draft, analyzing the overall duration and rhythm distribution of the draf