EP-4742060-A1 - METHOD AND SYSTEMS FOR EFFICIENTLY STORING VIDEO DATA

EP4742060A1EP 4742060 A1EP4742060 A1EP 4742060A1EP-4742060-A1

Abstract

A video feed including a plurality of consecutive frames may be captured. A text-based description is generated for each of the plurality of consecutive video frames using an Artificial Intelligence (AI) based Video-to-Language Model (VLM). The text-based description describes one or more of an object, an activity and/or a scene captured in the respective video frame of the plurality of consecutive video frames. Some of the plurality of consecutive video frames of the video feed are selected to be reference video frames based at least in part on the text-based descriptions. The text-based descriptions are stored to the video surveillance data repository. The reference video frames are stored to the video surveillance data repository while those video frames of the plurality of consecutive video frames that are not selected as reference video frames are not stored.

Inventors

AHADH, Abdhul
S K, Pramod Krishna

Assignees

Honeywell International Inc.

Dates

Publication Date: 20260513
Application Date: 20251020

Claims (15)

A method for storing video surveillance data of a video surveillance system in a video surveillance data repository (18), the method comprising: receiving a video feed (14) captured by a video camera (16) of the video surveillance system, the video feed including a plurality of consecutive video frames; generating a text-based description for each of the plurality of consecutive video frames using an Artificial Intelligence (AI) based Video-to-Language Model (VLM) (22), the text-based description describing one or more of an object, an activity and/or a scene captured in the respective video frame of the plurality of consecutive video frames; selecting less than all of the plurality of consecutive video frames of the video feed to be reference video frames, wherein the selection is based at least in part on the text-based descriptions for the plurality of consecutive video frames; storing the text-based descriptions for each of the plurality of consecutive video frames to the video surveillance data repository; and storing the reference video frames to the video surveillance data repository while not storing those video frames of the plurality of consecutive video frames that are not selected as reference video frames.
The method of claim 1, further comprising: for each of the generated text-based descriptions, identifying a corresponding one of the reference video frames, wherein at least two of the text-based descriptions correspond to a common reference video frame; and storing a correspondence between the generated text-based descriptions and the reference video frames.
The method of either of claims 1-2, wherein selecting less than all of the plurality of consecutive video frames of the video feed to be reference video frames comprises: determining when the text-based descriptions of the plurality of consecutive video frames indicate at least a threshold change in one or more of an object, an activity and/or a scene between two consecutive video frames of the plurality of consecutive video frames, and when so, selecting the later one of the two consecutive video frames as one of the reference video frames.
The method any of claims 1-2, wherein selecting less than all of the plurality of consecutive video frames of the video feed to be reference video frames comprises: determining when the text-based descriptions of the plurality of consecutive video frames do not indicate at least a threshold change in one or more of an object, an activity and/or a scene between two consecutive video frames of the plurality of consecutive video frames, and when so, not selecting the later one of the two consecutive video frames as one of the reference video frames.
The method of any of claims 1-4, further comprising: tokenizing and embedding the text-based description of each of the plurality of consecutive video frames; determining a similarity index between the embedded text-based descriptions of pairs of consecutive video frames of the plurality of consecutive video frames; and for each pair of consecutive video frames, when the similarity index between the embedded text-based descriptions of the pair of consecutive video frames falls below a threshold value, select the later one of the pair of consecutive video frames as a reference video frame.
The method of any of claims 1-4, further comprising: tokenizing and embedding the text-based description of each of the plurality of consecutive video frames; determining a similarity index between the embedded text-based descriptions of pairs of consecutive video frames of the plurality of consecutive video frames; and determining a rolling average of the similarity indices for at least part of the plurality of consecutive video frames, and when the rolling average of the similarity indices changes by more than a temporal consistency threshold, selecting one of the plurality of consecutive video frames as a reference video frame.
The method of any of claims 1-6, further comprising: determining an attention score for each of the text-based descriptions for each of the plurality of consecutive video frames; determining when the attention score changes between pairs of consecutive video frames of the plurality of consecutive video frames by more than an attention change threshold; and for each pair of consecutive video frames, when the attention score changes between the pair of consecutive video frames by more than the attention change threshold, selecting the later one of the pair of consecutive video frames as a reference video frame.
The method of any of claims 1-4, further comprising: tokenizing and embedding the text-based description of each of the plurality of consecutive video frames; determining a similarity index between the embedded text-based descriptions of pairs of consecutive video frames of the plurality of consecutive video frames; determining an attention score for each of the text-based descriptions for each of the plurality of consecutive video frames, and determining an attention score change between pairs of consecutive video frames of the plurality of consecutive video frames; determining a rolling average of the similarity indices for at least part of the plurality of consecutive video frames, and determining a change in the rolling average of the similarity indices; and determining a reference frame detection trigger parameter based on a weighted combination of the similarity index, the attention score change and the change in the rolling average of the similarity indices, wherein when the reference frame detection trigger parameter meets a reference frame reference trigger threshold, selecting a corresponding one of the plurality of consecutive video frames as one of the reference video frames.
The method of any of claims 1-8, further comprising: receiving an input query from a user; retrieving from the video surveillance data repository a plurality of matching text-based descriptions that match the input query; retrieving from the video surveillance data repository one or more reference video frames that correspond to the plurality of matching text-based descriptions; and generating a reconstructed video feed using an Artificial Intelligence (AI) based Text-to- Video Model (TVM) (24), the TVM using the one or more reference video frames as a reference input to generate a plurality of reconstructed video frames that reconstruct the plurality of matching text-based descriptions.
The method of claim 9, wherein the input query comprises one or more of a time based query and a search based query.
A system for storing video surveillance data, comprising: an input (12) for receiving a video feed (14) captured by a video camera (16); a video surveillance data repository (18); a controller (20) operatively coupled to the input and the video surveillance data repository, the controller configured to: receive the video feed captured by the video camera, the video feed including a plurality of consecutive video frames; generate a text-based description for each of the plurality of consecutive video frames using an Artificial Intelligence (AI) based Video-to-Language Model (VLM) (22), the text-based description describing one or more of an object, an activity and/or a scene captured in the respective video frame of the plurality of consecutive video frames; select less than all of the plurality of consecutive video frames of the video feed to be reference video frames, wherein the selection is based at least in part on the text-based descriptions for the plurality of consecutive video frames; store the text-based descriptions for each of the plurality of consecutive video frames to the video surveillance data repository; and store the reference video frames to the video surveillance data repository while not storing those video frames of the plurality of consecutive video frames that are not selected as reference video frames.
The system of claim 11, wherein the controller is configured to: for each of the generated text-based descriptions, identify a corresponding one of the reference video frames, wherein at least two of the text-based descriptions correspond to a common reference video frame; and store a correspondence between the generated text-based descriptions and the reference video frames.
The system of either of claims 11-12, wherein the controller is configured to: select one of the plurality of consecutive video frames of the video feed as a reference video frame when the text-based descriptions of the plurality of consecutive video frames indicate at least a threshold change in context in the text-based descriptions of the plurality of consecutive video frames.
The system of any of claims 11-13, wherein the controller is configured to: tokenize the text-based description of each of the plurality of consecutive video frames; determine a similarity index between the embedded text-based descriptions of pairs of consecutive video frames of the plurality of consecutive video frames; and determine when the threshold change in context in the text-based descriptions of the plurality of consecutive video frames has occurred based at least in part on whether the similarity index between the embedded text-based descriptions falls below a threshold value.
A non-transitory computer readable medium storing instructions that when executed by one or more processors cause the one or more processors to: receive a video feed (12) captured by a video camera (16), the video feed including a plurality of video frames; generate a text-based description for each of the plurality of video frames using an Artificial Intelligence (AI) based Video-to-Language Model (VLM) (22), the text-based description describing one or more of an object, an activity and/or a scene captured in the respective video frame of the plurality of video frames; select less than all of the plurality of video frames of the video feed to be reference video frames, wherein the selection is based at least in part on the text-based descriptions for the plurality of video frames; store the text-based descriptions for each of the plurality of video frames to a video surveillance data repository (18); and store the reference video frames to the video surveillance data repository while not storing those video frames of the plurality of video frames that are not selected as reference video frames.

Description

TECHNICAL FIELD The present disclosure relates generally to video surveillance systems. More particularly, the present disclosure relates to methods and systems for efficiently storing video surveillance data in a video surveillance system. BACKGROUND Video surveillance systems generate vast amounts of data requiring extensive storage capacity. This can translate into significant infrastructure and maintenance costs. The importance of this problem is underscored by the growing reliance on surveillance for security and operational monitoring across various sectors, including retail, transportation, healthcare, and public safety. As the number of cameras and the resolution of footage increase, so do the associated storage requirements and costs. Moreover, the retrieval and analysis of specific events within this data can be time-consuming and inefficient, posing additional operational challenges and potentially delaying critical responses. What would be desirable are methods and system for reducing the storage requirements of a video surveillance system while still enabling recall of the video surveillance video feeds. SUMMARY The present disclosure relates generally to video surveillance systems. More particularly, the present disclosure relates to efficiently storing video surveillance data in a video surveillance system. An example may be found in a method that includes receiving a video feed captured by a video camera of a video surveillance system. The video feed includes a plurality of consecutive video frames. A text-based description is generated for each of the plurality of consecutive video frames using an Artificial Intelligence (AI) based Video-to-Language Model (VLM). The text-based description describes one or more of an object, an activity and/or a scene captured in the respective video frame of the plurality of consecutive video frames. Less than all of the plurality of consecutive video frames of the video feed are selected to be reference video frames, where the selection is based at least in part on the text-based descriptions for the plurality of consecutive video frames. The text-based descriptions for each of the plurality of consecutive video frames are stored to a video surveillance data repository. The reference video frames are stored to the video surveillance data repository while not storing those video frames of the plurality of consecutive video frames that are not selected as reference video frames. This can save significant storage space on the video surveillance data repository. Another example may be found in a system for storing video surveillance data. The illustrative system includes an input for receiving a video feed captured by a video camera, a video surveillance data repository, and a controller that is operatively coupled to the input and the video surveillance data repository. The video feed including a plurality of consecutive video frames. The controller is configured to receive the video feed captured by the video camera. The controller is configured to generate a text-based description for each of the plurality of consecutive video frames using an Artificial Intelligence (AI) based Video-to-Language Model (VLM), where the text-based description describes one or more of an object, an activity and/or a scene captured in the respective video frame of the plurality of consecutive video frames. The controller is configured to select less than all of the plurality of consecutive video frames of the video feed to be reference video frames, wherein the selection is based at least in part on the text-based descriptions for the plurality of consecutive video frames. The controller may store the text-based descriptions for each of the plurality of consecutive video frames to the video surveillance data repository. The controller is configured to store the reference video frames to the video surveillance data repository while not storing those video frames of the plurality of consecutive video frames that are not selected as reference video frames. Another example may be found in a non-transitory computer readable medium storing instructions. When the instructions are executed by one or more processors, the one or more processors are caused to receive a video feed captured by a video camera, the video feed including a plurality of video frames. The one or more processors are caused to generate a text-based description for each of the plurality of video frames using an Artificial Intelligence (AI) based Video-to-Language Model (VLM), where the text-based description describes one or more of an object, an activity and/or a scene captured in the respective video frame of the plurality of video frames. The one or more processors are caused to select less than all of the plurality of video frames of the video feed to be reference video frames, wherein the selection is based at least in part on the text-based descriptions for the plurality of video frames. The one or more process