CN-121815043-B - Natural language driving video generation method based on schematic diagram structure

CN121815043BCN 121815043 BCN121815043 BCN 121815043BCN-121815043-B

Abstract

The application provides a natural language driving video generation method based on schematic construction, which relates to the technical field of video generation and comprises the following steps of receiving natural language intention description of a user, accordingly obtaining a structured event sequence comprising a plurality of event nodes arranged in time sequence, wherein each event node comprises a scene, a main body and a behavior structured attribute label, and for event node pairs with logic association, the event node pairs are marked as at least one of time sequence relation or causal dependency relation, generating a multi-condition control signal group based on the structured event sequence and the logic relation among the nodes, integrating the multi-condition control signal group with a video generation model, and driving the video generation model to generate a video stream corresponding to the structured event sequence, wherein first-class condition signals are encoded into text prompt embedding of the video generation model in a denoising process, and second-class condition signals are encoded into space-time attention masks or motion dynamics priori vectors acting on potential feature spaces of the video generation model.

Inventors

WANG YANG

Assignees

天津白马星球智能科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260306

Claims (6)

1. The natural language driving video generation method based on schematic construction is characterized by comprising the following steps of: Receiving a natural language intention description of a user; Inputting the intent description into a time sequence-cause and effect event analysis model, and outputting a structured event sequence, wherein the time sequence-cause and effect event analysis model is constructed based on a pre-trained language model, the structured event sequence comprises a plurality of event nodes which are arranged in time sequence, each event node at least comprises three structured attribute labels of a scene, a main body and a behavior, and for event node pairs with logic association, the event node pairs are marked as at least one of a time sequence relationship or a cause and effect dependency relationship; Generating a multi-condition control signal group based on the structured event sequence and the logic relation among the nodes of the structured event sequence, wherein the multi-condition control signal group comprises a first type of condition signals used for controlling single-frame picture content to be consistent with event attributes and a second type of condition signals used for controlling visual transition and logic coherence among the events; The multi-condition control signal group is integrated with a video generation model, and the video generation model is driven to generate a video stream corresponding to the structured event sequence, wherein the integration is realized by encoding the first-type condition signals into text prompt embedding of the video generation model in the denoising process, and encoding the second-type condition signals into space-time attention masks or motion dynamics prior vectors acting on potential feature spaces of the video generation model; the generating of the multi-condition control signal group comprises the following steps: Extracting a scene, a main body and a behavior attribute label of each event node in the structured event sequence, and combining the scene, the main body and the behavior attribute labels into a corresponding text prompt sentence serving as a first type condition signal of the event node; For two event nodes marked as causal dependency, calculating the semantic similarity of corresponding first class condition signals in a text embedding space, and generating content anchoring strength parameters based on the semantic similarity; Compiling the space-time constraint map and the content anchor strength parameter together into the second class of condition signals; the second class of condition signals are encoded as a spatiotemporal attention mask applied to the video generation model latent feature space, comprising the steps of: Mapping the space-time constraint map added with the content anchoring strength parameter into a three-dimensional attention weight matrix, wherein the three dimensions respectively correspond to the batch dimension, the space height dimension and the space width dimension of the video frame; Introducing the attention weight matrix as an additional attention bias in a designated cross-attention layer of a U-Net decoder of the video generation model so as to enhance the semantic consistency of a picture area corresponding to a causal dependency event and weaken the interference of an irrelevant area in the denoising process; encoding the second class of conditional signals as a motion dynamics prior vector, comprising the steps of: Extracting action verbs in action attribute labels of continuous event nodes with time sequence relations; Converting the action verbs into corresponding motion track feature vectors by using a pre-trained action dynamics encoder; Performing smooth interpolation on motion track feature vectors of continuous event nodes to form a continuous motion track priori crossing the event nodes, and inputting the continuous motion track priori crossing the event nodes serving as the motion dynamics priori vectors to an optical flow prediction module or a motion compensation module of the video generation model; the content anchoring strength parameter is generated based on semantic similarity and is realized by predicting a network through a logic influence factor, and the method comprises the following steps of: Inputting the first type condition signals of each pair of event nodes with causal dependency relationship into a text encoder to obtain semantic embedding of the first type condition signals, and calculating the semantic similarity of the first type condition signals and the second type condition signals; Meanwhile, splicing attribute tag sets of the preceding event node and the following event node, inputting the attribute tag sets into a logic influence factor prediction network, and predicting to obtain a logic influence factor vector, wherein the logic influence factor vector at least comprises a causal necessity intensity component for representing causal necessity intensity; And carrying out weighted fusion on the semantic similarity and the causal necessity intensity component to generate a content anchoring intensity parameter.
2. The schematic-based natural language driven video generation method of claim 1, wherein the time-causal event resolution model is trained by: Constructing a training data set, wherein each sample comprises a section of natural language narrative text and a corresponding structured event sequence and logic relation diagram of manual annotation; and fine-tuning a pre-trained language model by taking the narrative text as input and the joint representation of the structured event sequence and the logic relation diagram as a supervision target to obtain the time sequence-causal event analysis model.
3. The schematic-based natural language driven video generation method of claim 1, wherein the logical impact factor vector further comprises a scene transition component that characterizes scene transition rationality and an action coherence component that characterizes action coherence strength.
4. The schematic-based natural language driven video generating method as recited in claim 3, further comprising the steps of: generating a multichannel spatial modulation map based on scene transition components, action continuity components and causal necessity intensity components in the logic impact factor vector; When the second type of condition signal is encoded into a spatiotemporal attention mask, the spatial modulation map of the multi-channel is multiplied element by element with the attention weight matrix to refine the attention offset corresponding to different logical dimensions at different spatial locations in the spatiotemporal attention mask.
5. The method for generating an artificial-based natural language driven video according to claim 4, wherein the generating a multi-channel spatial modulation map comprises the steps of: Performing time domain alignment and smoothing filtering processing on the scene transition component, the action continuity component and the causal necessity intensity component; The time domain alignment means that according to the time positions of the preceding event node and the following event node in the structured event sequence, mapping each component value onto a time axis generated by video, and ensuring that each component value decays to a preset baseline level in a time interval without event coverage; The smoothing filter processing is used for eliminating the numerical mutation of the modulation diagram in the time dimension caused by the switching of the event nodes, and generating a space modulation diagram of multiple channels which continuously evolve in time.
6. The method of generating an intentional-structure-based natural language driven video of claim 4, further comprising, prior to multiplying the multichannel spatial modulation map by the attention weight matrix element by element, the steps of: Determining whether a key frame interval in which the causal effect is obviously visualized exists in the video segment corresponding to the subsequent event node based on the causal necessity intensity component in the logic influence factor vector; If yes, in the key frame interval, according to the magnitude of the causal necessity intensity component, intensity of a space channel area associated with the follow-up event node main body and the key object in the multi-channel space modulation diagram is improved, and an enhanced space modulation diagram is generated.

Description

Natural language driving video generation method based on schematic diagram structure Technical Field The invention belongs to the technical field of video generation, and particularly relates to a natural language driving video generation method based on schematic diagram construction. Background In recent years, with the breakthrough of generation type artificial intelligence such as diffusion models, natural language driven video generation technology has been significantly advanced. The prior art mainly follows two paradigms, namely an end-to-end text-video generation model, a mapping from text description to video pixels is directly learned through large-scale video-text pair training, and a time sequence expansion method based on an image generation model, wherein key frames are generated according to the text first, and intermediate frames are generated through interpolation or prediction. However, these mainstream approaches expose fundamental limitations in dealing with complex, narrative intent descriptions that contain multiple events and inherent logic. The core problem is that the existing model essentially learns about the statistical relevance of text labels to visual content, lacking explicit understanding and structural control of the temporal logic and causal dependencies in the narrative intent. This results in the frequent occurrence of defects in the generated video, such as a logic confusion, a possible contradiction of the order of occurrence of events to the description, or a causal inversion (e.g. "tumbling" occurs before "stepping on banana skin"), a subject to scene inconsistency, a subject to scene loss, abrupt change or attribute drift in the video middle section, a scene switching being hard and not in line with the occurrence of events, a poor action continuity, a lack of dynamic continuity in line with the physical laws in the time axis of the action, appearing like a stitching of a series of non-coherent pictures. These drawbacks make it difficult to generate long video content conforming to human narrative cognition, logic rigor and visual fluency in the prior art, severely limiting its application in scenes requiring strict logic expression, such as film and television aided creation, interactive narrative, education simulation, and the like. Disclosure of Invention In view of the above-mentioned drawbacks or shortcomings in the prior art, a natural language driven video generation method based on schematic construction is provided, comprising the steps of: Receiving a natural language intention description of a user; Inputting the intent description into a time sequence-cause and effect event analysis model, and outputting a structured event sequence, wherein the time sequence-cause and effect event analysis model is constructed based on a pre-trained language model, the structured event sequence comprises a plurality of event nodes which are arranged in time sequence, each event node at least comprises three structured attribute labels of a scene, a main body and a behavior, and for event node pairs with logic association, the event node pairs are marked as at least one of a time sequence relationship or a cause and effect dependency relationship; Generating a multi-condition control signal group based on the structured event sequence and the logic relation among the nodes of the structured event sequence, wherein the multi-condition control signal group comprises a first type of condition signals used for controlling single-frame picture content to be consistent with event attributes and a second type of condition signals used for controlling visual transition and logic coherence among the events; The multi-condition control signal group is integrated with a video generation model, the video generation model is driven to generate a video stream corresponding to the structured event sequence, and the integration is achieved by encoding the first-type condition signals into text prompt embedding of the video generation model in a denoising process and encoding the second-type condition signals into space-time attention masks or motion dynamics prior vectors acting on potential feature spaces of the video generation model. According to the technical scheme provided by the application, the time sequence-causal event analysis model is trained through the following steps: Constructing a training data set, wherein each sample comprises a section of natural language narrative text and a corresponding structured event sequence and logic relation diagram of manual annotation; and fine-tuning a pre-trained language model by taking the narrative text as input and the joint representation of the structured event sequence and the logic relation diagram as a supervision target to obtain the time sequence-causal event analysis model. According to the technical scheme provided by the application, the generation of the multi-condition control signal group comprises the following steps: E