CN-121985192-A - Video special effect editing method and system based on aragonic video big model

CN121985192ACN 121985192 ACN121985192 ACN 121985192ACN-121985192-A

Abstract

The invention provides a video special effect editing method and a system based on a text-based video large model, which are characterized in that a space-time sparse condition marking sequence is constructed by utilizing a source video to reduce calculation redundancy and keep key dynamic information, a position coding correction and causal attention mechanism is introduced into the text-based video large model to isolate the interference of potential noise marks, the text-based video large model is finely tuned into a universal video editing model by utilizing large-scale video data through a double-stage fine tuning strategy, and then a video special effect editing model with text controllable special effect generating capability is obtained by further fine tuning on a special effect editing dataset based on a EffectLoRA module. And finally, realizing special effect injection by constructing a model driven by combining space-time sparse conditions and text instructions, so that the generated result can strictly maintain the space-time structural consistency of the original video while meeting the text description. The invention has the characteristics of accurate editing effect, strong universality and high real-time performance.

Inventors

MAO QI
LI YUANHANG
JIN LIBIAO
MA SIWEI

Assignees

中国传媒大学
北京大学

Dates

Publication Date: 20260505
Application Date: 20251217

Claims (10)

1. A video special effect editing method based on a text-based video large model is characterized by comprising the following three stages: acquiring a space-time sparse condition marking sequence and a time sparse marking sequence based on a source video, and splicing the space sparse marking sequence and the time sparse marking sequence along a marking dimension to obtain the space-time sparse condition marking sequence; The method comprises a two-stage fine tuning stage, wherein the two-stage fine tuning stage comprises the steps of acquiring a target video corresponding to a source video based on a preset target video database, acquiring a target mark sequence of the target video by utilizing a pre-training video encoder, carrying out noise adding processing on the target mark sequence to obtain a noise target mark sequence so as to fine tune a preset text video big model into a universal video editing model for executing a universal video editing task, and carrying out fine tuning on the universal video editing model for executing the universal video editing task based on a preset EffectLoRA module by utilizing video special effect editing data containing paired visual special effect examples so as to endow the universal video editing model with an editing mode of special visual special effects; The method comprises the steps of constructing a source video to be edited into a space-time sparse condition mark sequence, obtaining a noise target mark sequence of the source video to be edited, splicing the noise target mark sequence and the space-time sparse condition mark sequence of the source video to be edited along a mark dimension, and sending the noise target mark sequence and the space-time sparse condition mark sequence to a universal editing model with EffectLoRA so as to inject visual special effects into the source video to be edited.
2. The video special effect editing method based on a large text-to-video model according to claim 1, wherein the acquiring a spatial sparse marker sequence and a temporal sparse marker sequence based on a source video comprises: Performing space downsampling and first frame extraction on the source video to obtain a low-resolution video and a first frame; The first frame is encoded as a spatially sparse tag sequence and the downsampled low resolution video is encoded as a temporally sparse tag sequence.
3. The video special effect editing method based on a large text-to-video model according to claim 1, wherein the acquiring a spatial sparse marker sequence and a temporal sparse marker sequence based on a source video comprises: Encoding the source video by using a 3D VAE to obtain a source mark sequence; and performing sparse sampling on the source marker sequence in a space dimension to obtain a time sparse marker sequence.
4. The method for video special effect editing based on a large text-to-video model of claim 1, further comprising adding a position coding correction and causal attention module to a pre-trained large text-to-video model based on the spatiotemporal sparse condition marker sequence before fine tuning the pre-set large text-to-video model to a general video editing model for performing a general video editing task in the dual-stage fine tuning stage; And (3) carrying out noise adding processing on the target mark sequence so as to finely tune the large text-to-video model with the position coding correction and causal attention module into a general video editing model for executing general video editing tasks.
5. The video special effect editing method based on a text-to-video large model according to claim 4, wherein adding a position coding correction and causal attention module to a pre-trained text-to-video large model based on the spatiotemporal sparse condition marker sequence comprises: the method comprises the steps of inserting position coding correction into a large aragonic video model, avoiding space-time sparse condition marking sequences and generating space-time dislocation existing in space, establishing a real corresponding relation between a time sparse marking sequence and a target area for the time sparse marking sequences to prevent the space dislocation, and using the position coding of the first frame of the target marking sequence for the space sparse marking sequences to avoid the time error; The bidirectional attention mechanism of the large aragonic video model is changed into a causal attention mechanism by designing an attention mask so as to avoid the influence of a noise target mark sequence on a time sparse condition mark sequence.
6. The video special effect editing method based on a text-to-video large model according to claim 5, wherein the attention mask is designed as follows: Wherein, the Representing the mask at the row index i and the column index j, Representing a noisy target marker sequence.
7. The method of claim 6, wherein the noise-adding the target marker sequence to fine tune the large video-in-text model with position-coding correction and causal attention module to a generic video editing model that performs a generic video editing task comprises: Splicing the target mark sequence and the space-time sparse condition mark sequence of the source video to be edited along the mark dimension, and sending the spliced target mark sequence and the space-time sparse condition mark sequence of the source video to be edited into a pre-training text-generated video large model; And carrying out noise adding treatment on the target mark sequence only, wherein the space-time sparse condition mark sequence is kept unchanged, and the loss function is as follows: Wherein, the Representing a conditional probability path at time step t And The desired value of the sum of the values, In order to make the sequence of the target marker, In order to be a time step, the time step, In the form of a text instruction, In order to make a time-sparse signature sequence, For a spatially sparse marker sequence, A velocity field predicted for a large model of a video for text, For the field of the target vector, Representing noise.
8. The method for editing video effects based on a large text-to-video model as claimed in claim 4, wherein said performing the universal video editing task on the universal video editing model based on a preset EffectLoRA module comprises: Inserting a low rank LoRA into the universal video editing model for performing the universal video editing task; Splicing the target marker sequence and the space-time sparse condition marker sequence of the source video to be edited along the marker dimension, and sending the target marker sequence and the space-time sparse condition marker sequence of the source video to be edited into a universal video editing model inserted with a low rank LoRA; And keeping the space-time sparse condition marker sequence unchanged, and only adding noise to the target marker sequence, wherein the loss function is as follows: Wherein, the Representing a conditional probability path at time step t And The desired value of the sum of the values, In order to make the sequence of the target marker, In order to be a time step, the time step, In the form of a text instruction, In order to make a time-sparse signature sequence, For a spatially sparse marker sequence, A velocity field predicted for a large model of a video for text, For the field of the target vector, Representing noise.
9. A video special effect editing system based on a text-based video big model, which is applied to an electronic device, and is characterized by comprising the following components: The system comprises a source video, a space-time sparse condition marking sequence constructing unit, a double-stage fine tuning unit, a noise adding unit and a video editing unit, wherein the space-time sparse condition marking sequence constructing unit is used for acquiring a space sparse marking sequence and a time sparse marking sequence based on the source video, splicing the space sparse marking sequence and the time sparse marking sequence along a marking dimension to acquire the space-time sparse condition marking sequence; the inference unit is used for constructing the source video to be edited into a space-time sparse condition mark sequence and obtaining a noise target mark sequence of the source video to be edited, splicing the noise target mark sequence and the space-time sparse condition mark sequence of the source video to be edited along a mark dimension, and sending the noise target mark sequence and the space-time sparse condition mark sequence to a universal editing model with EffectLoRA so as to inject a visual special effect into the source video to be edited.
10. An electronic device comprising a memory, a processor, and a video effect editing program based on a meridional video big model stored on the memory and executable on the processor, the video effect editing program based on a meridional video big model implementing the video effect editing method based on a meridional video big model as claimed in claims 1 to 8 when executed by the processor.

Description

Video special effect editing method and system based on aragonic video big model Technical Field The invention relates to the field of video editing in the technical field of artificial intelligence, in particular to a video special effect editing method and system based on a text-generated video large model. Background Visual effects (VFX) aim to create video or edit existing video by incorporating attractive visual elements such as flames, cartoon characters or particle effects. As a core technology for movie production, games and virtual reality, VFX enriches visual narratives, highlights key elements, and creates an immersive experience. However, conventional video VFX production processes rely on complex animation designs, computer-generated images, and specialized post-synthesis. These manufacturing processes result in high manufacturing costs, long cycle times, and require extensive manual intervention, thereby impeding personalized or real-time applications. In recent years, advances in the art of civilian video (T2V) generation open up new possibilities for automated VFX creation. However, prior art techniques have not been fully explored in video VFX editing (i.e., automatically adding or modifying special effects in existing video). As a unique and higher-level video editing task, video VFX editing is essentially different from video VFX generation. The core aim is to seamlessly and realistically integrate visual effects into source video on the premise of strictly preserving the spatial structure and temporal consistency of the original content. Although recent video editing models have made significant progress in various editing tasks, they still have difficulty meeting the stringent requirements of video special effects editing. Existing video editing models often allow a degree of background or appearance variation that makes it difficult to ensure pixel-level consistency with the source video. This limitation is not acceptable in video effect editing because the background must remain completely unchanged. In addition, unlike the general video editing method that utilizes large-scale data to improve performance, generating high-quality paired special effect data is very challenging, limiting the scalability of model training. Efficient video effect editing must learn the unique patterns of effect injection from these high quality paired samples to achieve physically consistent fusion of effects with real scenes. The above challenges have made automated video VFX editing a largely unresolved problem. Therefore, a solution for enabling automated video effect editing is needed. Disclosure of Invention In view of the problems in the prior art, the invention aims to provide a video special effect editing method and system based on a large text-to-video model, aiming at the challenges of the video special effect editing method, the source video is expressed as a space-time sparse context condition, and the inherent internal context learning capability of DiT architecture is utilized, so that the video special effect is injected into the source video while the space-time characteristics of the source video are kept unchanged, and the automatic video special effect editing is realized. On one hand, the invention provides a video special effect editing method based on a text-generated video large model, which comprises the following three stages: acquiring a space-time sparse condition marking sequence and a time sparse marking sequence based on a source video, and splicing the space sparse marking sequence and the time sparse marking sequence along a marking dimension to obtain the space-time sparse condition marking sequence; The method comprises a two-stage fine adjustment stage, wherein the two-stage fine adjustment stage comprises the steps of acquiring a target video corresponding to a source video based on a preset target video database, acquiring a target mark sequence of the target video by utilizing a pre-training video encoder, carrying out noise adding processing on the target mark sequence to obtain a noise target mark sequence so as to finely adjust a preset text video big model into a universal video editing model for executing a universal video editing task, and carrying out fine adjustment on the universal video editing model for executing the universal video editing task based on a preset EffectLoRA module by utilizing video special effect editing data comprising paired visual special effect examples so as to endow the universal video editing model with an editing mode of special visual special effects; The method comprises the steps of constructing a source video to be edited into a space-time sparse condition mark sequence, obtaining a noise target mark sequence of the source video to be edited, splicing the noise target mark sequence and the space-time sparse condition mark sequence of the source video to be edited along a mark dimension, and sending the noise target mark sequen