CN-121999415-A - Long video question-answering method and system based on visual language model and causal reasoning tree

CN121999415ACN 121999415 ACN121999415 ACN 121999415ACN-121999415-A

Abstract

A long video question-answering method and system based on a visual language model and a causal reasoning tree comprises the steps of carrying out time sequence segmentation on an input long video, calling the visual language model to generate text description for each segment, analyzing the text description into causal units, constructing leaf nodes, then constructing intermediate nodes and root nodes to form a hierarchical causal reasoning tree to construct a lightweight index for each node, receiving a user question, executing search strategies on the tree by a question-answering agent, integrating information to generate a final answer and the like. The system comprises a video analysis module, a causal tree construction module and a question-answering agent module, and the steps of the method are completed. The invention adopts a training-free frame based on a pre-training model completely, can also carry out fine adjustment to further improve the effect, effectively solves the problem of low information retrieval accuracy of a single visual language model in a long video understanding task, and improves the answer accuracy of the questions aiming at global and local contents.

Inventors

HUANG XINLI
ZHANG XINYI

Assignees

华东师范大学

Dates

Publication Date: 20260508
Application Date: 20260209

Claims (10)

1. A long video question-answering method based on a visual language model and a causal reasoning tree comprises the following steps: Step S1, a video analysis module receives a long video and performs time sequence segmentation to obtain a plurality of continuous short video segments; S2, calling a visual language model to generate text description for each short video segment; s3, a causal unit extracting unit calls a large language model and analyzes each text description into a structured causal unit; S4, forming a hierarchical causal inference tree, namely constructing leaf nodes by using each short video segment and causal units thereof as a basis by using a hierarchical aggregation unit, and constructing intermediate nodes and root nodes from bottom to top through semantic clustering to form the hierarchical causal inference tree; S5, an index construction unit generates a text embedded vector for each node in the causal inference tree, and constructs a global lightweight vector index; S6, the intelligent question-answering agent receives a user question and locates key nodes in the causal inference tree by inquiring the vector index; step S7, according to the type of the problem, the intelligent question-answering agent executes a corresponding search strategy on the causal inference tree, and integrates information of related nodes; And S8, based on the integrated information, the intelligent question-answering agent generates and outputs a final natural language answer.
2. The method of claim 1, wherein the time sequence segmentation in step S1 is a uniform segmentation or an adaptive segmentation method based on scene change detection, and the segmented segment length is within 60 seconds.
3. The long-video question-answering method based on a visual language model and a causal inference tree according to claim 1, wherein the aggregation process in step S4 is implemented by calling a large language model, and the large language model generates a content abstract and a coherent causal chain description of a parent node under the guidance of a prompt word according to the description and causal unit of a child node.
4. The long-video question-answering method based on a visual language model and a causal inference tree according to claim 1, wherein in step S5, the lightweight vector index uses the output of a text embedding model as a vector index, and the text embedding model uses a pre-training model based on a Transformer architecture.
5. The long-video question-answering method based on a visual language model and a causal inference tree according to claim 1, wherein the search strategy of step S7 comprises: If the problem is of a specific content focusing type, the description, the causal unit and the information of the parent node are read from the related node returned by the vector index to be integrated and answered.
6. A long video question-answering system based on a visual language model and a causal inference tree, the system comprising: The video analysis module is used for receiving an input long video, dividing the long video into a plurality of continuous short video fragments, and calling a visual language model to generate a text description containing visual entities, actions and events for each video fragment; The causal tree construction module is connected with the video analysis module and is responsible for receiving text descriptions of all video fragments, the causal unit extraction unit calls a large language model to analyze each section of description into one or more causal units representing atomic relations among a subject, an action and an object, and based on semantics and time consistency of the causal units, a plurality of nodes are assembled into higher-level event nodes from bottom to top, and finally a hierarchical causal reasoning tree comprising leaf nodes, intermediate nodes and root nodes is constructed; The intelligent question-answering agent module is connected with the causal tree construction module and is used for receiving a user question, locating a node set most relevant to the question in a causal reasoning tree by inquiring the lightweight vector index, adaptively triggering different traversing strategies according to the type of the question, performing heuristic search and information integration on the causal reasoning tree, and finally generating and outputting an answer.
7. The long-video question-answering system based on a visual language model and a causal inference tree according to claim 6, wherein the causal tree construction module specifically comprises: The causal unit extracting unit is used for inputting the text description of the video clip into a large language model, and guiding the output format of the model through the structured prompt word to be causal unit of which the [ subject ] causes/affects the [ object/result ] due to the [ action/state ]; The hierarchical aggregation unit is used for clustering leaf nodes representing video clips according to the continuity of a main body in the causal unit, the causal degree and the time sequence compactness of the event to form intermediate nodes representing sub-scenes or composite events; And the index construction unit is used for generating a high-dimensional vector for the abstract, the causal chain and the keyword information of each node by using the text embedding model, and constructing a vector index of the whole tree by adopting an approximate nearest neighbor search algorithm.
8. The long video question-answering system based on a visual language model and causal inference tree according to claim 6, wherein the intelligent question-answering agent module further comprises: the problem classification unit is used for inputting user questions and specific prompt words into the large language model, classifying the problems into a global summary type or a specific content focusing type, and designing different causal tree search strategies; The index retrieval unit is used for calculating an embedded vector of a user problem, and rapidly retrieving K nodes which are most relevant to the lightweight vector index in the causal inference tree by querying the lightweight vector index, and the K nodes are used as follow-up nodes to be queried; the strategy executing unit is used for executing a corresponding tree searching strategy according to the result of the problem classifying unit, directly accessing the root node to obtain the abstract for the global summary type problem, and for the specific content focusing type problem, tracing back causal along the tree structure by taking the node returned by the index searching unit as a starting point and integrating multi-node information to form an answer.
9. The long-video question-answering system based on a visual language model and a causal inference tree according to claim 6, wherein the visual language model and the large language model adopted by the system are both pre-training models, and reasoning is performed by relying on knowledge and prompt word engineering of the pre-training models.
10. The long video question-answering system based on a visual language model and a causal inference tree according to claim 7, wherein the causal unit extraction unit further comprises a causal unit for storing the structured information output by the causal unit extraction unit, including subject, action/status, object/result.

Description

Long video question-answering method and system based on visual language model and causal reasoning tree Technical Field The invention relates to the technical field of artificial intelligence, in particular to a computer vision, natural language processing and video content understanding technology, and especially relates to a long video question-answering method and system based on a vision language model and a causal reasoning tree. Background With the popularization of applications such as video and monitoring, how to make a machine automatically understand long video content and answer a complex question of a user has become an important study subject. The existing long video question-answering technology mainly faces two major challenges, namely that firstly, the information content of a long video is huge and redundant, the problem of low calculation efficiency and focus ambiguity exists when the whole video is directly used for end-to-end question-answering, and secondly, a plurality of problems relate to deep reasoning on causality behind an event, which requires a model to not only recognize objects and actions, but also understand logical association between the objects and actions. Currently, some approaches attempt to pre-process video using video subtitle generation or scene segmentation, but the generated descriptions are typically linear and flat, lack explicit modeling of hierarchical, causal relationships between events, and have difficulty supporting "why" type reasoning. Other studies have introduced graph neural networks to model relationships in video, but this typically requires extensive supervised training for specific tasks, is costly and has limited generalization capabilities. In addition, the visual language model and the large language model based on large-scale pre-training show strong cross-modal understanding and generating capability, and excellent effects are shown on short video question-answering tasks within 5 minutes. However, these models cannot be effectively applied over long videos of 30 minutes or longer, which is determined by the time complexity O (n 2) of the large language model core module transducer. Therefore, how to effectively organize these ready-made, training-free basic models, and solve the problem of causal question-answering of long videos through a structured mechanism, is still an underexplored field. The prior art lacks a training-free unified framework capable of dynamically organizing video content, establishing causal memory, and supporting efficient and accurate retrieval and reasoning. Disclosure of Invention The invention aims to overcome the defects of the prior art, and provides a long video question-answering method based on a visual language model and a causal inference tree, which is used for carrying out structural understanding on long video content through a hierarchical causal inference tree in a completely training-free mode and realizing more accurate complex question-answering by utilizing a lightweight index and an intelligent agent. In order to achieve the above purpose, the invention adopts the following technical scheme: A long video question-answering method based on a visual language model and a causal reasoning tree comprises the following steps: step S1, a video analysis module performs time sequence segmentation on an input long video to obtain a plurality of continuous short video segments; s2, a video analysis module calls a visual language model to generate text description for each short video segment; s3, a causal unit extracting unit calls a large language model and analyzes each text description into a structured causal unit; s4, constructing leaf nodes by using each short video segment and a causal unit thereof as a basis by using a layering aggregation unit, and constructing intermediate nodes and root nodes from bottom to top through semantic clustering to form a layering causal reasoning tree; S5, an index construction unit generates a text embedded vector for each node in the causal inference tree, and constructs a global lightweight vector index; S6, the intelligent question-answering agent receives a user question and locates key nodes in the causal inference tree by inquiring the vector index; step S7, according to the type of the problem, the intelligent question-answering agent executes a corresponding search strategy on the causal inference tree, and integrates information of related nodes; And S8, based on the integrated information, the intelligent question-answering agent generates and outputs a final natural language answer. Further, in step S1, the time sequence segmentation adopts a uniform segmentation or an adaptive segmentation method based on scene change detection, and the segmented segment length is within 60 seconds. Further, in step S4, the aggregation process is implemented by calling a large language model, and the large language model generates a content abstract and a coherent causal