CN-122019831-A - Multi-mode video reasoning method and system based on layered multi-agent

CN122019831ACN 122019831 ACN122019831 ACN 122019831ACN-122019831-A

Abstract

The invention discloses a multi-mode video reasoning method and system based on layered multi-agent, belonging to the technical field of artificial intelligence and multi-mode video understanding. The method comprises the steps of receiving an input video and natural language query, analyzing the complex query into a plurality of logically coherent sub-questions through a question decomposition agent, distributing the sub-questions to a multi-source answer generation agent, wherein the sub-questions comprise a Web-based answer generation agent, a time sequence memory-based answer generation agent and a video-language answer generation agent, modeling long-range dependence of the video through a visual memory bank and a query memory bank by the time sequence memory-based agent, fusing time sequence characteristics through a cross-attention mechanism, carrying out consistency voting and fusion on the multi-source answers through a decision agent to generate final answers, and carrying out training optimization on a system based on a cross entropy loss function.

Inventors

Dang Jisheng
WAN QUAN
Wang Bimei
PENG HONG
WANG SHUDE

Assignees

兰州大学

Dates

Publication Date: 20260512
Application Date: 20260122

Claims (10)

1. The multi-mode video reasoning method based on the layered multi-agent is characterized by comprising the following steps: S1, receiving an input video and natural language query; S2, resolving the complex query into a plurality of logically coherent sub-questions through a question resolution agent; S3, distributing the sub-questions to a multi-source answer generation agent, wherein the multi-source answer generation agent comprises an answer generation agent based on Web, an answer generation agent based on time sequence memory and a video-language answer generation agent; s4, modeling long-range dependence of the video by the agent based on the time sequence memory through a visual memory bank and a query memory bank, and fusing time sequence characteristics by adopting a cross-attention mechanism; s5, carrying out consistency voting and fusion on the multi-source answers through a decision-making agent to generate final answers; And S6, training and optimizing the system based on the cross entropy loss function.
2. The method of claim 1, wherein the multi-source answer generation agent comprises: generating an agent based on the Web answer, which is used for executing external knowledge retrieval and generating the Web answer; generating an agent based on the answers of the time sequence memories, which is used for modeling long-range time sequence dependence of the video and generating answers based on the time sequence; The video-language answer generation agent is used for directly reasoning based on the video content to generate a video language answer.
3. The method of claim 2, wherein the timing memory based answer generation agent comprises a visual memory bank and a query memory bank for storing visual features and historical query features of video; the features in the visual memory bank and the query memory bank are fused through a dual memory bank cross-attention mechanism, and a time sequence perception representation related to the current query is generated.
4. The method of claim 1, wherein the decision agent comprises a consistency voting mechanism for consistency assessment of multi-source answers; If the consistency of the multi-source answers meets a preset threshold, carrying out wording optimization and unification on the answers through a large language model to generate final answers; If the consistency of the multi-source answers does not meet the preset threshold, an expert model is called to conduct multi-source evidence fusion, conflict identification and resolution, and a final answer is generated.
5. The method of claim 1, wherein the system training optimization comprises training the system using a cross entropy loss function as follows: wherein the method comprises the steps of For the total number of real text labels, Is the first The number of the actual marks is the number of the actual marks, Is the first The sequence of the tag preceding the individual tags, In order to input the video data, Representing a given video And generating a mark when the mark is pre-marked And optimizes all parameters using AdamW optimizer.
6. A multi-modal video reasoning system based on layered multi-agent for implementing the method of any of claims 1 to 5, comprising an input module for receiving an input video and a natural language query; The problem decomposition agent is used for resolving the complex query into a plurality of logically coherent sub-problems; the multi-source answer generation agent comprises an agent for generating an answer based on Web, an agent for generating an answer based on time sequence memory and an agent for generating a video-language answer; the decision-making intelligent body is used for carrying out consistency voting and fusion on the multi-source answers to generate a final answer; And the training module is used for training and optimizing the system based on the cross entropy loss function.
7. The system of claim 6, wherein the multi-source answer generation agent comprises a Web-based answer generation agent, an external retrieval interface configured to retrieve external knowledge and generate a Web-based answer; Generating an agent based on the answers of the time sequence memories, configuring a visual memory bank and a query memory bank, and modeling long-range dependence of the video and generating the answers based on the time sequence; The video-language answer generation agent is used for directly reasoning based on the video content to generate a video language answer.
8. The system of claim 7, wherein the timing memory based answer generation agent comprises a visual memory bank and a query memory bank for storing visual features and historical query features of video; and the cross-attention mechanism is used for fusing the characteristics in the visual memory bank and the query memory bank and generating a time sequence perception representation related to the current query.
9. The system of claim 6, wherein the decision agent comprises: the consistency voting module is used for carrying out consistency evaluation on the multi-source answers; and the answer fusion module is used for generating a final answer according to the consistency evaluation result.
10. The system of claim 6, wherein the training module comprises: a cross entropy loss function for calculating a difference between the system output and the true signature; AdamW optimizer for optimizing system parameters.

Description

Multi-mode video reasoning method and system based on layered multi-agent Technical Field The invention relates to the technical field of artificial intelligence and video question and answer, in particular to a multi-mode video reasoning method and system based on layered multi-agent. Background Multimodal video reasoning has become the basic capability of artificial intelligence, enabling systems to understand dynamic visual content, model timing dependencies, and integrate cross-modal semantics. It supports a wide range of applications including video questions and answers, event localization and scene understanding. Video is dynamic and multi-modal in nature, encapsulating visual, auditory, and occasional text streams while exhibiting complex timing dependencies, as compared to static images or text data. Inference of such data requires an understanding of the long-range timing structure and the use of external, often domain-specific, knowledge to locate ambiguous content. The existing video language model mainly has the following limitations: 1) Limitations of closed world training data independent models such as video language agents, while exhibiting strong timing positioning performance, are inherently limited by closed world training data and often cannot answer queries that rely on external knowledge, such as regulatory standards, scientific principles, or real world events. 2) The generalization capability of small-scale datasets is inadequate-models trained on small-scale datasets, even with specialized timing architecture, lack the generalization capability required to adapt across different video domains and contextual scenes. 3) Video adaptation problem of search enhancement generation RAGs most RAG methods treat video as a set of static frames ignoring the timing dynamics and long-range context necessary for full video understanding. Retrieval strategies developed for images or text are often unable to capture video specific features such as variable length time-series segments, evolving scenes, and continuous actions. 4) Heterogeneous input integration of single source retrieval limits many RAG frameworks rely on single source retrieval, which limits their ability to integrate heterogeneous inputs (e.g., web-based knowledge and temporal video memory), both of which are critical to overall video reasoning. Therefore, solving these drawbacks is critical to improving the accuracy, scalability, and robustness of the multi-modal video inference system. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a multi-mode video reasoning method and a multi-mode video reasoning system based on layered multi-agent, which can remarkably improve the accuracy, expandability and robustness of video reasoning. The invention provides a multi-mode video reasoning method based on layered multi-agent, which comprises the following steps: S1, receiving an input video and natural language query; S2, resolving the complex query into a plurality of logically coherent sub-questions through a question resolution agent; S3, distributing the sub-questions to a multi-source answer generation agent, wherein the multi-source answer generation agent comprises an answer generation agent based on Web, an answer generation agent based on time sequence memory and a video-language answer generation agent; s4, modeling long-range dependence of the video by the agent based on the time sequence memory through a visual memory bank and a query memory bank, and fusing time sequence characteristics by adopting a cross-attention mechanism; s5, carrying out consistency voting and fusion on the multi-source answers through a decision-making agent to generate final answers; And S6, training and optimizing the system based on the cross entropy loss function. Further, the multi-source answer generation agent includes: generating an agent based on the Web answer, which is used for executing external knowledge retrieval and generating the Web answer; generating an agent based on the answers of the time sequence memories, which is used for modeling long-range time sequence dependence of the video and generating answers based on the time sequence; The video-language answer generation agent is used for directly reasoning based on the video content to generate a video language answer. Further, the answer generation agent based on time sequence memory comprises a visual memory bank and a query memory bank, wherein the visual memory bank and the query memory bank are used for storing visual characteristics and historical query characteristics of videos; the features in the visual memory bank and the query memory bank are fused through a dual memory bank cross-attention mechanism, and a time sequence perception representation related to the current query is generated. Further, the decision agent comprises a consistency voting mechanism for carrying out consistency evaluation on multi-source answers; If the consisten