CN-121985190-A - Automatic generation method, system and equipment for auxiliary video material

CN121985190ACN 121985190 ACN121985190 ACN 121985190ACN-121985190-A

Abstract

The application is suitable for the technical field of videos, and provides an automatic generation method of auxiliary video materials, which comprises the steps of analyzing a main video, extracting structural features of the main video, wherein the structural features comprise visual semantic vectors, video atmosphere codes and motion track data, inputting the visual semantic vectors and the video atmosphere codes into a condition editor to generate a condition embedded sequence, inputting the motion track data into the motion editor to generate a structure guide graph sequence, inputting the condition embedded sequence and the structure guide graph sequence into a video generation model, combining random noise latent images to generate a plurality of candidate auxiliary materials, and carrying out consistency evaluation on each candidate auxiliary material and combining the main video to determine the auxiliary video materials.

Inventors

WANG CHUANPENG
LI BO

Assignees

广州三七极梦网络技术有限公司

Dates

Publication Date: 20260505
Application Date: 20251209

Claims (10)

1. An automatic generation method of auxiliary video materials is characterized by comprising the following steps: Analyzing a main video, and extracting structural features of the main video, wherein the structural features comprise visual semantic vectors, video atmosphere codes and motion trail data; inputting the visual semantic vector and the video atmosphere code into a condition editor to generate a condition embedding sequence; Inputting the motion trail data into a motion editor to generate a structure guide graph sequence; Inputting the conditional embedding sequence and the structure guide diagram sequence into a video generation model, and generating a plurality of candidate auxiliary materials by combining random noise latent images; and carrying out consistency evaluation on the candidate auxiliary materials and combining the main video to determine auxiliary video materials.
2. The method of claim 1, wherein the analyzing the primary video extracts structured features of the primary video, the structured features including visual semantic vectors, video atmosphere coding, and motion trajectory data, comprising: extracting a key frame sequence from the main video based on shot boundary detection and content variation saliency analysis; For each key frame in the key frame sequence, identifying a core object through an instance segmentation model, and extracting a visual semantic vector of each core object based on a pre-trained visual encoder; Calculating global image statistical characteristics of the key frame sequence through a style characteristic extraction network, and generating a video atmosphere code; and in continuous frames of the main video, performing cross-frame tracking on the specified object through a target tracking algorithm, acquiring a motion coordinate sequence of the specified object, and constructing motion track data of the specified object.
3. The method of claim 2, wherein extracting a sequence of key frames from the primary video based on shot boundary detection and content change saliency analysis comprises: detecting shot switching points in the main video by adopting a shot boundary detection algorithm, wherein a shot unit corresponds to each two adjacent shot switching points; for each lens unit, performing content change significance analysis to generate a content change curve; According to the local extremum point of the content change curve and in combination with the duration of the lens unit, determining the number of key frames and candidate key frames to be extracted in the lens unit; And clustering the candidate key frames of all the lens units, removing the candidate key frames with redundant visual contents, and generating a key frame sequence.
4. The method of claim 2, wherein computing global image statistics for the key frame sequence through a style feature extraction network and generating a video atmosphere code comprises: Inputting the key frame sequence into a pre-trained style feature extraction network, calculating a Gram matrix of each key frame, and taking the Gram matrix as style statistical features of the corresponding key frames; Carrying out average pooling on the style statistics characteristics of each key frame to generate a global style descriptor; the global style descriptor is compressed and mapped to a video atmosphere encoding, which is a predefined latent space vector.
5. The method according to claim 2, wherein the step of performing cross-frame tracking on the specified object through the target tracking algorithm in the continuous frames of the main video to obtain the motion coordinate sequence thereof and constructing the motion trail data of the specified object includes: determining a designated object based on the core object of each key frame or through user interaction designation; Continuously predicting and updating the boundary box of the appointed object and the unique identity mark thereof in a video frame sequence of the main video by adopting a multi-target tracking algorithm; and recording a center point coordinate sequence of each specified object in the main video to form the motion trail data.
6. The method of claim 1, wherein said inputting the visual semantic vector and the video ambience code into a conditional editor generates a conditional embedding sequence comprising: carrying out standardization and serialization processing on the visual semantic vector and the video atmosphere coding; And mapping the processed visual semantic vector and the video atmosphere code to a visual semantic adapter and an atmosphere code adapter respectively to generate a conditional embedding sequence.
7. The method of claim 1, wherein the inputting the motion trajectory data into a motion editor generates a sequence of structure guide maps, comprising: Generating a Gaussian heat map taking the center point coordinate as a center in a two-dimensional coordinate system of a corresponding frame according to the center point coordinate of each time step in the motion trail data; stacking Gaussian heat maps corresponding to each time step in the motion trail data to form a structure guide map sequence; And carrying out convolution fusion on the structure guide graph sequence and the latent feature graph of the video generation model, and guiding composition and movement of the generated content in a space constraint mode.
8. The method of claim 1, wherein said performing a consistency assessment for each of said candidate auxiliary materials in conjunction with said primary video to determine auxiliary video material comprises: Calculating the time sequence consistency score and the space structure score of the candidate auxiliary materials; respectively extracting the candidate auxiliary materials and the main video by adopting a multi-mode model, and calculating semantic similarity scores; and generating a comprehensive quality score for each candidate auxiliary material through a weighted decision model based on the time sequence consistency score, the spatial structure score and the semantic similarity score, and determining the auxiliary video material based on the comprehensive quality score.
9. An automatic generation system for auxiliary video material, comprising: The first processing module is used for analyzing the main video and extracting structural features of the main video, wherein the structural features comprise visual semantic vectors, video atmosphere codes and motion trail data; the second processing module is used for inputting the visual semantic vector and the video atmosphere code into a condition editor to generate a condition embedding sequence; the third processing module is used for inputting the motion trail data into a motion editor and generating a structure guide graph sequence; A fourth processing module, configured to input the conditional embedding sequence and the structure guide map sequence into a video generation model, and combine the random noise latent images to generate a plurality of candidate auxiliary materials; And a fifth processing module, configured to perform consistency evaluation on each candidate auxiliary material in combination with the main video, and determine an auxiliary video material.
10. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 8 when executing the computer program.

Description

Automatic generation method, system and equipment for auxiliary video material Technical Field The application belongs to the technical field of videos, and particularly relates to an automatic generation method, system and equipment for auxiliary video materials. Background In the video production process, reasonable collocation of main video and auxiliary video materials is an important means for improving video quality, is beneficial to improving creative expressive force of video production and generates high-quality video content, and the generation of the auxiliary video materials in the prior art generally depends on a keyword matching generation method in an existing material library, and lacks of depth understanding and creative insertion of the video content, so that material selection is limited in the video production process and creative expression is insufficient. With the development of artificial intelligence, a method for automatically generating auxiliary video materials by using an AI large model is adopted, specifically, text description is generated by carrying out content analysis on a main video, text prompt words are generated based on the text description, and the text prompt words are input into the AI large model to automatically generate the auxiliary video materials. However, the above auxiliary video material generated based on the AI large model causes video information loss to a certain extent due to the conversion from video to text, especially fine visual texture, complex motion pattern and abstract emotion atmosphere are difficult to describe with text accurately, and secondly, text prompt words have ambiguity, the same prompt words may generate content with different styles, thus the consistency of the generated auxiliary video material and main video is difficult to be guaranteed fundamentally. Disclosure of Invention The embodiment of the application provides an automatic generation method, system and equipment of auxiliary video materials, which can solve one of the problems in the prior art. In a first aspect, an embodiment of the present application provides a method for automatically generating an auxiliary video material, including: Analyzing a main video, and extracting structural features of the main video, wherein the structural features comprise visual semantic vectors, video atmosphere codes and motion trail data; inputting the visual semantic vector and the video atmosphere code into a condition editor to generate a condition embedding sequence; Inputting the motion trail data into a motion editor to generate a structure guide graph sequence; Inputting the conditional embedding sequence and the structure guide diagram sequence into a video generation model, and generating a plurality of candidate auxiliary materials by combining random noise latent images; and carrying out consistency evaluation on the candidate auxiliary materials and combining the main video to determine auxiliary video materials. Further, the analyzing the main video, extracting the structural features of the main video, where the structural features include visual semantic vectors, video atmosphere coding and motion trail data, includes: extracting a key frame sequence from the main video based on shot boundary detection and content variation saliency analysis; For each key frame in the key frame sequence, identifying a core object through an instance segmentation model, and extracting a visual semantic vector of each core object based on a pre-trained visual encoder; Calculating global image statistical characteristics of the key frame sequence through a style characteristic extraction network, and generating a video atmosphere code; and in continuous frames of the main video, performing cross-frame tracking on the specified object through a target tracking algorithm, acquiring a motion coordinate sequence of the specified object, and constructing motion track data of the specified object. Further, the extracting a key frame sequence from the main video based on shot boundary detection and content change significance analysis includes: detecting shot switching points in the main video by adopting a shot boundary detection algorithm, wherein a shot unit corresponds to each two adjacent shot switching points; for each lens unit, performing content change significance analysis to generate a content change curve; According to the local extremum point of the content change curve and in combination with the duration of the lens unit, determining the number of key frames and candidate key frames to be extracted in the lens unit; And clustering the candidate key frames of all the lens units, removing the candidate key frames with redundant visual contents, and generating a key frame sequence. Further, the computing global image statistical features of the key frame sequence through a style feature extraction network and generating a video atmosphere code comprises the following steps: Inputting