CN-122021934-A - Streaming video reasoning method and system based on self-adaptive hierarchical event storage

CN122021934ACN 122021934 ACN122021934 ACN 122021934ACN-122021934-A

Abstract

The invention discloses a streaming video reasoning method and a streaming video reasoning system based on self-adaptive hierarchical event storage, which are characterized in that a three-level memory structure comprising a short-term high-fidelity window, a medium-term buffer and a long-term event forest is constructed, and a reasoning process is divided into a coarse reasoning stage and a fine reasoning stage, wherein the coarse reasoning stage preferentially uses the short-term window and a highly summarized abstract to carry out trial answers, and a specific signal is output to judge whether searching is needed or not; triggering a fine reasoning stage only when the coarse reasoning stage judges that the information is insufficient, generating a specific search query by a model based on the context, and searching a specific historical slice in an event forest instead of using the original user problem; the invention combines the short context with the search context to generate the answer, thereby realizing the accurate capturing of the long time sequence key information under extremely low calculation power consumption.

Inventors

LI GUANBIN
LI JIAMING
LIANG ZHIJIA

Assignees

中山大学
中山大学深圳研究院

Dates

Publication Date: 20260512
Application Date: 20260324

Claims (9)

1. The streaming video reasoning method based on the self-adaptive hierarchical event storage is characterized by comprising the following steps of: constructing short window and medium resolution window for reserving time frames of different lengths, wherein short window length Less than mid resolution window length Short window sampling frame rate Sampling frame rate greater than medium resolution ; Constructing an event forest based on hierarchical event storage ; Coarse reasoning: The user poses a problem in the current frame MLLM based on instant accessible memory Generating preliminary responses ; To guide the big model to follow the two-stage reasoning strategy, prompt, MLLM is the big language model, the immediate-access memory Aggregation including short window reservation Set of medium resolution window reservations Forest of events Is a finite root node summary of (1) Question and answer history abstract ; To the preliminary response generated Analyzing the special mark, namely extracting the content of the mark as an answer to be output if the answer mark is detected, and executing a fine reasoning stage if the tool call mark is detected; Fine reasoning phase: MLLM according to the preliminary response Generating task-oriented semantic internal search queries And encoded as embedded ; In event forest Middle computing embedding And vector cosine similarity among all nodes in the event forest, and searching out the highest score Each event node and search out the most relevant in the question-answer history Question-answer pairs; The instructions MLLM generate an output: In which, in the process, Is the output of the fine reasoning stage; For additional prompts, including Key frames and in individual event nodes The content of the question-answer pair.
2. The method for streaming video reasoning based on adaptive hierarchical event storage of claim 1, wherein for the current first Frame, short window reserved set Mid-resolution window reservation set Wherein The image frame is at time t.
3. The method for streaming video reasoning based on adaptive hierarchical event storage of claim 1, wherein the constructing an event forest based on hierarchical event storage The method specifically comprises the following steps: for each incoming Mid-resolution window of frame, uniform sampling Key frames and abstracts the text abstract through MLLM Embedding vectors ; Initializing to a hierarchy Leaf node of (a) And add to the root node set, where Referring to event nodes Key frames [ , And refers to the time range that the event node refers to.
4. The streaming video reasoning method based on adaptive hierarchical event storage of claim 3, wherein when the number of root nodes exceeds a preset threshold When the method is used, a merging mechanism with bounded complexity is executed, specifically: Computing pairs of time adjacent root nodes Is a combined score of (2): ; Wherein, the For the vector cosine similarity, Hyper-parameters for punishing the merging of high-level nodes to avoid excessive abstraction; selecting the node pair with the highest merging score to merge to obtain a new father node 。
5. The streaming video reasoning method based on adaptive hierarchical event storage according to claim 4, wherein the node pair with the highest merging score is selected to merge, specifically: The feature tensor of the two nodes is cascaded, then the feature tensor serving as the father node is uniformly sampled, LLM merging is performed on the abstracts of the two nodes to serve as the abstracts of the father node, and the hierarchy of the father node is updated as follows 。
6. The streaming video reasoning method based on adaptive hierarchical event storage of claim 1, wherein the question-answer history summary is used to aggregate historical question-answer information by And dynamically updating.
7. The method for streaming video reasoning based on adaptive hierarchical event storage of claim 1, wherein the in-event forest Calculating cosine similarity to obtain the highest score In the process of the nodes, adopting a greedy pruning strategy, iteratively selecting the node with the highest similarity, and eliminating ancestor nodes and descendant nodes of the node with the highest similarity.
8. The streaming video reasoning system based on the self-adaptive hierarchical event storage is characterized by being applied to the streaming video reasoning method based on the self-adaptive hierarchical event storage, which is disclosed in any one of claims 1-7, and comprises a short window, a medium resolution window, a multi-resolution event hierarchy, a history question-answer storage module and MLLM; The short window is used for reserving the length as Is sampled at a frame rate of The medium resolution window is used for reserving the length as Is sampled at a frame rate of Wherein, the method comprises the steps of, , ; The multi-resolution event hierarchy is used to organize historical events into dynamically updated event forests ; The history question-answer storage module is used for aggregating history question-answer pairs, and comprises a question-answer history and a question-answer history abstract; MLLM is a large language model for performing a coarse reasoning phase, a special mark parsing, and a fine reasoning phase for a problem posed by a user.
9. The adaptive hierarchical event storage based streaming video inference system according to claim 8, wherein said MLLM employs a large language model with visual understanding capabilities, instruction following capabilities, and tool calling capabilities, including one of Qwen-VL, qwen2-VL, qwen2.5-VL, qwen3-VL, GPT-4o, and Gemini Pro Vision.

Description

Streaming video reasoning method and system based on self-adaptive hierarchical event storage Technical Field The invention belongs to the technical field of streaming video understanding and reasoning, and particularly relates to a streaming video reasoning method and system based on self-adaptive hierarchical event storage. Background Early streaming video understanding methods (full context stacking and compression methods) attempted to place all of the history information into the MLLM context window. To cope with the infinitely growing nature of video streams, these methods typically employ sparse frame sampling or time pooling (Temporal Pooling) to compress the historical information. However, simple stacking can result in rapid exhaustion of the Token budget, and aggressive compression can result in irreversible fine-grained time detail loss. More importantly, when the context is too long, the model attention mechanism tends to "collapse" (Attention Collapse) in large amounts of noise, resulting in the inability to focus on truly critical information. Another class of external memory-based naive RAG methods utilizes an external memory bank to store past observations and uses a retrieval enhancement generation (RAG) technique to retrieve information. It is common practice to use the user's original Query (Raw Query) to calculate the similarity to the history memory (e.g., top-k search), and then stitch the retrieved information to the current frame. This "fixed search and stack" strategy presents serious semantic misalignment problems. The user's question (e.g., "do a teddy bear in classroom? visually similar but semantically unrelated segments (e.g., segments that simply contain" classrooms "but no teddy bear) may be retrieved, resulting in erroneous answers. In addition, the method searches every frame, and lacks judgment on the current context, so that the waste of computing resources is caused. The most main and critical technical problems are that in a large-scale long-term video stream, fundamental contradiction exists between infinitely-growing time sequence context and extremely sparse effective evidence. As the video stream continues to be input, the amount of historical data expands linearly indefinitely, while user questions at each moment often rely only on very small and highly localized key segments of the history (i.e. "oasis"). In the prior art, if a Full Context (Full Context) strategy is adopted, all historical frames are blindly stacked and input into a multi-mode large model (MLLM), so that not only can Token budget be rapidly exhausted and calculation cost be increased rapidly, but also serious attention collapse (Attention Collapse) phenomenon can be caused, namely, the attention mechanism of the model cannot focus in noise desert of massive redundant information, so that key signals are submerged and illusions or wrong answers are generated, otherwise, if an aggressive compression or sparse sampling strategy is adopted to save expenses, irreversible fine granularity space-time detail loss can be caused, and decisive evidence necessary for subsequent reasoning is easily erased accidentally in the compression process. Therefore, the prior art cannot effectively solve the two-law reverse problem between the storage of massive redundant historical data and the retrieval of accurate tiny evidence under the constraint of limited video memory and computing resources, and is difficult to consider both the acute perception of the current moment and the accurate backtracking of long-term historical details. Secondary problems: Dislocation of semantic retrieval. The existing retrieval enhancement generation (RAG) method generally adopts a stiff fixed retrieval strategy, namely, the similarity matching based on embedded vectors is directly carried out by using the original query text input by a user and video fragments in a history memory bank, and the mechanism has a serious semantic gap when processing complex streaming video reasoning tasks. Since natural language questions of users often contain high-level semantics, complex reference relationships, or implicit temporal logic (e.g., asking "where just before the red-clothing person"), direct use of the query to retrieve often matches visually similar but semantically completely unrelated segments (e.g., matches only the visual features of "red-clothing" and not the correct points in time or events), resulting in the retrieval result being rich in noise unrelated to the current reasoning task. The searching mode lacking task adaptability cannot generate accurate searching hypothesis according to the intermediate state of reasoning, so that when the model faces to the detailed backtracking or multi-step reasoning problem in long-term video, wrong answers are generated due to the fact that key decisive evidences cannot be positioned, and the understanding precision of the system under a complex interaction scene is seriously restricted. Th