US-20260127166-A1 - VideoRAG using Natural Language as Intermediate Representation in Multi-Camera, Closed-Domain Applications

US20260127166A1US 20260127166 A1US20260127166 A1US 20260127166A1US-20260127166-A1

Abstract

A Video Retrieval Augmented Generation (VideoRAG) system for closed-domain applications that uses natural language text as an intermediate representation between video content and query systems. A unified vision-language model (VLM) processes video frames and generates structured JSON text descriptions conforming to domain-specific event schemas, while simultaneously answering natural language queries through retrieval-augmented generation. The natural language intermediate representation provides substantial storage efficiency improvements over embedding-based approaches, human-interpretable analytics capabilities, and cross-camera entity tracking. The architecture supports a closed-domain applications including but not limited to retail analytics, healthcare monitoring, and industrial safety operations.

Inventors

Faria Azim
Ikra Iftekhar Shuvo

Assignees

Faria Azim
Ikra Iftekhar Shuvo

Dates

Publication Date: 20260507
Application Date: 20251229

Claims (20)

1 . A computer-implemented method for video retrieval augmented generation using natural language intermediate representation, comprising: (a) Scene Sensing Module—Processing sampled video frames to: (i) detect meaningful motion events while filtering environmental motion comprising at least curtain movements and shadow movements, (ii) generate sensory logs comprising crude entity descriptions in structured natural language format, wherein said crude entity descriptions follow a pattern entity_type, and (iii) route disaster alerts comprising at least fire, smoke, or flood detections to an alert pathway; (b) Entity Trail Logger—Employing a reasoning-capable vision-language model to: (i) analyze said sensory logs using thought tokens comprising internal reasoning mechanisms to select relevant video clips for detailed examination, (ii) generate entity trail logs comprising clip-based temporal segments, wherein each entity trail log entry specifies: an entity identifier, a camera identifier, a clip start time, a clip end time, and a natural language description of entity activities within said temporal segment, (iii) identify key frames within each clip-based temporal segment where entities are most identifiable, (iv) track entities across multiple camera views by computing weighted attribute correlation scores between entities appearing in different cameras, wherein said weighted correlation comprises clothing attribute similarity, height similarity, and temporal proximity, (v) store both sensory logs and entity trail logs in a dual-schema storage architecture, wherein sensory logs are stored in chronologically-sorted files for sequential access and entity trail logs are stored in a database indexed using a temporal B-tree index and an entity-reference inverted index generated from said key frames, and (vi) delay indexing of entity trail log entries until entity state remains unchanged for a predetermined stabilization duration, preventing premature indexing of incomplete entity trails; (c) User-facing Agent—Processing natural language queries to: (i) employ thought tokens comprising internal reasoning mechanisms to analyze query intent and determine which log types to search, (ii) retrieve relevant log entries from said sensory logs and said entity trail logs using hybrid sparse-dense text matching algorithms comprising BM25 for sparse lexical matching and sentence embeddings for dense semantic matching, (iii) determine relevant video segments based on retrieved log entries, and (iv) generate natural language responses with mandatory citations to source events and video timestamps, wherein said responses are derived exclusively from said retrieved log entries.
2 . The method of claim 1 , wherein said vision-language model processing pipeline achieves real-time or faster-than-real-time processing for video frame analysis while concurrently handling natural language queries.
2 a. The method of claim 1 , wherein said vision-language model processing for scene sensing, entity trail logging, and query processing is performed by a single unified vision-language model instance configured with continuous batching to enable concurrent processing of video frames and natural language queries, thereby minimizing graphics processing unit requirements and wherein said continuous batching mechanism dynamically batches requests from video frame processing and natural language query processing pipelines with efficient utilization of available processing resources.
2 b. The method of claim 1 , wherein said vision-language model employs parallelization techniques to distribute model computation across plurality of graphics processing units, enabling efficient deployment on commodity hardware.
3 . The method of claim 1 , wherein said storing of natural language descriptions results in a storage consumption rate of approximately 200 to 500 bytes per described event, achieving a compression ratio of at least 10:1 compared to storing dense vector embeddings for an equivalent duration of video surveillance.
4 . The method of claim 1 , wherein said adaptive sampling logic increases the sampling rate to approximately 5 frames per second when an optical flow magnitude exceeds a motion threshold and decreases the sampling rate to approximately 1 frame per second when said magnitude falls below said threshold.
5 . The method of claim 1 , wherein said hybrid retrieval mechanism comprises: (a) a sparse retrieval stage utilizing a BM25 algorithm to identify initial candidate records; followed by (b) a dense reranking stage utilizing sentence transformer embeddings to reorder said candidate records based on semantic similarity to the user query; wherein said hybrid mechanism achieves higher normalized discounted cumulative gain (NDCG) scores than sparse-only or dense-only retrieval methods.
6 . The method of claim 1 , wherein said domain-specific schema defines attributes for person detection including at least a height value or category enumerated as short, medium, or tall, and clothing color/attribute categories utilizing a standardized color palette.
7 . The method of claim 1 , wherein said vision-language model in query processing mode is configured with a system prompt explicitly prohibiting the use of parametric knowledge not present in the retrieved log records (sensory logs or entity trail logs), and wherein said method further comprises a validation step to verify that every citation in the generated response corresponds to a valid trail identifier or log timestamp in the retrieved context.
8 . The method of claim 1 , further comprising a validation pipeline for the generated natural language descriptions that: (a) verifies conformance to said domain-specific schema; (b) checks for temporal consistency between the description timestamp and video frame timestamp; and (c) triggers a regeneration of the description with error-feedback prompting if validation fails.
9 . The method of claim 1 , wherein said predetermined stabilization duration is a predetermined amount of time, wherein entity trail log entries are indexed only after the entity state remains unchanged for said predetermined amount of time.
9 a. The method of claim 1 , further comprising an Event Statistics Generator module that: (a) analyzes said sensory logs and said entity trail logs to generate event count statistics over specified time ranges, (b) maintains an event counter log tracking occurrence frequencies of specific event types, and (c) provides a query interface for retrieving aggregated event counts grouped by time windows and locations.
9 b. The method of claim 1 , further comprising an Emergency Channel module that: (a) receives disaster alerts from said Scene Sensing Module and suspicious activity alerts from said Entity Trail Logger, (b) classifies alerts into severity levels comprising at least CRITICAL, HIGH, MEDIUM, and LOW priorities, and (c) executes responsive actions based on said severity levels, wherein said responsive actions comprise at least one of: notifying security personnel, contacting law enforcement, controlling door locks, or isolating spatial zones.
9 c. The method of claim 9 b, wherein said responsive actions for CRITICAL severity alerts comprise sending live video feeds from relevant cameras to security personnel and triggering automated notifications to emergency services within 30 seconds of alert generation.
10 . A video retrieval augmented generation apparatus implementing natural language intermediate representation architecture, comprising: (a) Scene sensing processor configured to: (i) process sampled video frames to detect meaningful motion events while filtering environmental motion, (ii) generate sensory logs comprising crude entity descriptions in structured natural language format, and (iii) route disaster alerts to an alert pathway; (b) Entity trail logger processor with reasoning capabilities, configured to: (i) analyze said sensory logs using thought tokens to select relevant video clips, (ii) generate entity trail logs comprising clip-based temporal segments with entity identifiers, camera identifiers, clip start times, clip end times, and natural language activity descriptions, (iii) identify key frames where entities are most identifiable, (iv) track entities across cameras using weighted attribute correlation, (v) implement a dual-schema storage architecture storing sensory logs in chronologically-sorted files for sequential access and entity trail logs in a database with temporal B-tree indexing and entity-reference inverted index generated from key frames, and (vi) implement a delayed indexing mechanism that indexes entity trail log entries only after entity state remains unchanged for a predetermined stabilization duration; (c) User-facing agent processor with reasoning capabilities, configured to: (i) employ thought tokens to analyze query intent and determine log types to search, (ii) implement a hybrid retrieval subsystem performing sparse lexical matching using BM25 and dense semantic matching using sentence embeddings on said sensory logs and said entity trail logs, (iii) determine relevant video segments based on retrieved logs, and (iv) generate natural language responses with mandatory citations to source events and video timestamps.
11 . The apparatus of claim 10 , wherein said vision-language inference processor comprises one or more graphics processing units (GPUs), and wherein said apparatus is configured to process video from one of more camera sources while simultaneously handling natural language queries.
12 . The apparatus of claim 10 , wherein each stored text record in said database comprises a unique event identifier, a timestamp, a camera identifier, and a JSON-formatted description string containing the structured natural language description.
13 . The apparatus of claim 10 , wherein said database utilizes a B-tree index for temporal range queries and an inverted index for entity identifier lookups, enabling retrieval complexity of O(log n) for temporal queries.
14 a. The apparatus of claim 10 , further comprising an event statistics processor configured to: (a) analyze said sensory logs and said entity trail logs to generate event count statistics, (b) maintain an event counter log, and (c) provide aggregated event counts grouped by time windows and locations.
14 b. The apparatus of claim 10 , further comprising an emergency channel interface configured to: (a) receive disaster alerts from said scene sensing processor and suspicious activity alerts from said entity trail logger processor, (b) classify alerts into severity levels, and (c) execute responsive actions comprising at least notification of security personnel, contact of law enforcement, or control of door locks.

Description

FIELD OF INVENTION The present invention relates to video content analysis and retrieval systems, more particularly to a technical architecture employing natural language text (not embeddings) as an intermediate representation layer between video content and retrieval-augmented generation systems for closed-domain VideoRAG applications. BACKGROUND Definitions Video-RAG addresses correctly responding to natural language queries regarding online videos (continuously streamed from cameras) or offline videos (stored). For example, “What did the person in the blue jacket do in the electronics aisle between 2 and 3 PM?” with responses including timestamps and video segment references. Closed-Domain Applications refer to VideoRAG systems for predetermined application areas rather than general-purpose analysis. The scope of relevant events is known beforehand-customer behaviors in retail, patient activities in healthcare, safety compliance in industrial settings-rather than arbitrary video content analysis. Related Prior Art Video-RAG Systems and Embedding-Based Approaches Retrieval-Augmented Generation (RAG) retrieves relevant text from a knowledge base to generate accurate, grounded natural language responses. RAG systems have been extended to video through various approaches. The predominant architecture, as disclosed in U.S. Ser. No. 11/954,151B1 (Coram A I, 2024), employs joint embedding spaces where video content and natural language queries convert to dense vectors for similarity matching using CLIP-style architectures. This processes video frames through vision encoders generating fixed-dimensional embeddings (typically 512 or 768 dimensions), stores these in vector databases, and performs cosine similarity matching against query embeddings. While embedding-based systems achieve semantic understanding, they present fundamental technical limitations for long-duration VideoRAG: 1. Storage Scaling for Intermediate Representations: Embedding-based systems suffer from a fundamental architectural limitation. Vector embeddings are generated for video frames regardless of content significance. This means the searchable index grows proportionally with video frame count, not event count. Embeddings are generated and stored continuously even during inactivity, such as overnight hours when retail stores are closed or hallways with no foot traffic. This disproportionately large number of vector embeddings causes significant storage and retrieval challenges. In contrast, natural language intermediate representations are event-driven: searchable descriptions stored only when significant events are detected (person enters, picks up product, completes transaction). Since VideoRAG typically applies where videos contain long inactivity periods interspersed with brief relevant activity, event-based representation storage fundamentally scales better than frame-based embedding storage for closed-domain applications.2. Loss of Human-Explainability: Dense vector representations lack human-readable semantics. When retrieval systems return video segments based on embedding similarity scores (e.g., cosine similarity>0.85), operators cannot inspect the intermediate representation to understand why matches occurred or verify correctness without reviewing video footage.3. Fine-Grained Detail Capture: Embedding requires compression, inherently losing frame-level details. Applications requiring precise temporal tracking (e.g., “person picked up item at 14:23:47, held for 12 seconds, returned to shelf at 14:23:59”) cannot reliably extract such information from fixed-length vector representations. US patent US20250181641A1 (NEC Laboratories America Inc., 2025) discloses a two-stage incremental system where lightweight models generate text descriptions converted to embeddings for indexing, followed by selective heavyweight model processing at query time. It maintains vector embeddings rather than human-readable text logs and lacks persistent entity tracking across cameras. Chinese patent CN120656102A discloses multimodal vector systems using dual-tower retrieval frameworks with dense embeddings from visual, audio, and ASR text modalities. CN120316308B employs sparse vector representations using dictionary learning (K-SVD) and orthogonal matching pursuit (OMP) algorithms. While sparse representations reduce storage through mathematical sparsity, they remain fundamentally vectorial and lack linguistic interpretability and fine-grained detail. Text-Based Video Indexing Prior Art Early text-based video indexing, as disclosed in expired U.S. Pat. No. 5,835,667A (Carnegie Mellon, 1996), used speech transcription to create indexed text transcripts for video search. More recent U.S. Ser. No. 10/999,566B1 (Amazon, 2021) covers neural network generation of textual descriptions from video segments. However, these text-based indexing systems differ architecturally from the present invention in critical ways: 1. Comprehensive vs. Selective Event Capture: