CN-122027765-A - Method and system for efficiently storing video data

CN122027765ACN 122027765 ACN122027765 ACN 122027765ACN-122027765-A

Abstract

A video feed comprising a plurality of consecutive frames may be captured. An Artificial Intelligence (AI) -based video-to-language model (VLM) is used to generate a text-based description for each video frame of a plurality of consecutive video frames. The text-based description describes one or more of objects, activities, and/or scenes captured in respective ones of the plurality of consecutive video frames. Some of a plurality of consecutive video frames of the video feed are selected as reference video frames based at least in part on the text-based description. The text-based description is stored to a video surveillance data repository. Reference video frames are stored to the video surveillance data repository, while those of the plurality of consecutive video frames that are not selected as reference video frames are not stored.

Inventors

A. Ahad
Pramod Krishna S.K

Assignees

霍尼韦尔国际公司

Dates

Publication Date: 20260512
Application Date: 20251015
Priority Date: 20241112

Claims (10)

1. A method for storing video surveillance data of a video surveillance system in a video surveillance data repository (18), the method comprising: receiving a video feed (14) captured by a camera (16) of the video surveillance system, the video feed comprising a plurality of consecutive video frames; Generating a text-based description for each video frame of the plurality of consecutive video frames using an Artificial Intelligence (AI) -based video-to-language model (VLM) (22), the text-based description describing one or more of objects, activities, and/or scenes captured in the respective video frame of the plurality of consecutive video frames; selecting less than all of the plurality of consecutive video frames of the video feed as reference video frames, wherein the selecting is based at least in part on the text-based description for the plurality of consecutive video frames; storing the text-based description for each video frame of the plurality of consecutive video frames to the video surveillance data repository, and The reference video frames are stored to the video surveillance data repository without storing those of the plurality of consecutive video frames that are not selected as reference video frames.
2. The method of claim 1, the method further comprising: For each of the generated text-based descriptions, identifying a corresponding one of the reference video frames, wherein at least two of the text-based descriptions correspond to a common reference video frame, and And storing the corresponding relation between the generated text-based description and the reference video frame.
3. The method of claim 1, wherein selecting less than all of the plurality of consecutive video frames of the video feed as reference video frames comprises at least one of: Determining when the text-based description of the plurality of consecutive video frames indicates at least a threshold change in one or more of an object, activity, and/or scene between two consecutive video frames of the plurality of consecutive video frames, and when the threshold change is indicated, selecting a later video frame of the two consecutive video frames as one of the reference video frames, and/or Determining when the text-based description of the plurality of consecutive video frames does not indicate at least a threshold change in one or more of an object, activity, and/or scene between two consecutive video frames of the plurality of consecutive video frames, and when the threshold change is not indicated, not selecting the later video frame of the two consecutive video frames as one of the reference video frames.
4. A method according to any one of claims 1 to 3, the method further comprising: tokenizing and embedding the text-based description of each video frame of the plurality of consecutive video frames; Determining a similarity index between the embedded text-based descriptions of pairs of consecutive video frames of the plurality of consecutive video frames, and at least one of: For each successive pair of video frames, selecting a later video frame of the successive pair of video frames as a reference video frame when the similarity index between the embedded text-based descriptions of the successive pair of video frames falls below a threshold value, and/or A rolling average of the similarity index for at least a portion of the plurality of consecutive video frames is determined, and one of the plurality of consecutive video frames is selected as a reference video frame when the rolling average of the similarity index changes beyond a temporal consistency threshold.
5. A method according to any one of claims 1 to 3, the method further comprising: determining an attention score for each of the text-based descriptions for each of the plurality of consecutive video frames; Determining when the attention score changes between successive pairs of video frames in the plurality of successive video frames by exceeding an attention change threshold value, and For each pair of consecutive video frames, selecting the later video frame of the pair of consecutive video frames as a reference video frame when the attention score changes between the pair of consecutive video frames beyond the attention change threshold.
6. A method according to any one of claims 1 to 3, the method further comprising: tokenizing and embedding the text-based description of each video frame of the plurality of consecutive video frames; determining a similarity index between the embedded text-based descriptions of pairs of consecutive video frames of the plurality of consecutive video frames; Determining an attention score for each of the text-based descriptions for each of the plurality of consecutive video frames and determining an attention score change between consecutive pairs of video frames in the plurality of consecutive video frames; Determining a rolling average of the similarity index for at least a portion of the plurality of consecutive video frames and determining a change in the rolling average of the similarity index, and A reference frame detection trigger parameter is determined based on a weighted combination of the similarity index, the attention score change, and the change in the rolling average of the similarity index, wherein when the reference frame detection trigger parameter meets a reference frame reference trigger threshold, a corresponding video frame of the plurality of consecutive video frames is selected as one of the reference video frames.
7. A method according to any one of claims 1 to 3, the method further comprising: receiving an input query from a user; Retrieving from the video surveillance data repository a plurality of matched text-based descriptions that match the input query; Retrieving one or more reference video frames corresponding to the plurality of matched text-based descriptions from the video surveillance data repository, and A reconstructed video feed is generated using an Artificial Intelligence (AI) -based text-to-video model (TVM) (24) that uses the one or more reference video frames as reference inputs to generate a plurality of reconstructed video frames that reconstruct the plurality of matched text-based descriptions.
8. The method of claim 7, wherein the input query comprises one or more of a time-based query and a search-based query.
9. A system for storing video surveillance data, the system comprising: -an input (12) for receiving a video feed (14) captured by a camera (16); A video surveillance data repository (18); A controller (20) operatively coupled to the input and the video surveillance data repository, the controller configured to: Receiving the video feed captured by the camera, the video feed comprising a plurality of consecutive video frames; Generating a text-based description for each video frame of the plurality of consecutive video frames using an Artificial Intelligence (AI) -based video-to-language model (VLM) (22), the text-based description describing one or more of objects, activities, and/or scenes captured in the respective video frame of the plurality of consecutive video frames; selecting less than all of the plurality of consecutive video frames of the video feed as reference video frames, wherein the selecting is based at least in part on the text-based description for the plurality of consecutive video frames; storing the text-based description for each video frame of the plurality of consecutive video frames to the video surveillance data repository, and The reference video frames are stored to the video surveillance data repository without storing those of the plurality of consecutive video frames that are not selected as reference video frames.
10. The system of claim 11, wherein the controller is configured to: For each of the generated text-based descriptions, identifying a corresponding one of the reference video frames, wherein at least two of the text-based descriptions correspond to a common reference video frame, and And storing the corresponding relation between the generated text-based description and the reference video frame.

Description

Method and system for efficiently storing video data Technical Field The present disclosure relates generally to video surveillance systems. More particularly, the present disclosure relates to a method and system for efficiently storing video surveillance data in a video surveillance system. Background Video surveillance systems generate large amounts of data, requiring large amounts of storage capacity. This can translate into significant infrastructure and maintenance costs. The importance of this problem stands out in the increasing reliance on monitoring for safety and operational monitoring across various departments, including retail, transportation, healthcare, and public safety. As the number of cameras and picture resolution increases, so does the associated storage requirements and costs. Furthermore, the retrieval and analysis of specific events within the data can be time consuming and inefficient, thereby creating additional operational challenges and potentially delaying critical responses. What is desired is a method and system for reducing storage requirements of a video surveillance system while still being able to invoke a video surveillance video feed. Disclosure of Invention The present disclosure relates generally to video surveillance systems. More particularly, the present disclosure relates to efficiently storing video surveillance data in a video surveillance system. One example may reside in a method that includes receiving a video feed captured by a camera of a video surveillance system. The video feed includes a plurality of consecutive video frames. An Artificial Intelligence (AI) -based video-to-language model (VLM) is used to generate a text-based description for each video frame of a plurality of consecutive video frames. The text-based description describes one or more of objects, activities, and/or scenes captured in respective ones of the plurality of consecutive video frames. Less than all of a plurality of consecutive video frames of the video feed are selected as reference video frames, wherein the selection is based at least in part on a text-based description for the plurality of consecutive video frames. A text-based description for each video frame of a plurality of consecutive video frames is stored to a video surveillance data repository. The reference video frames are stored to the video surveillance data repository without storing those of the plurality of consecutive video frames that are not selected as reference video frames. This may save a lot of storage space on the video surveillance data repository. Another example may exist for a system for storing video surveillance data. An exemplary system includes an input for receiving a video feed captured by a camera, a video surveillance data repository, and a controller operatively coupled to the input and the video surveillance data repository. The video feed includes a plurality of consecutive video frames. The controller is configured to receive a video feed captured by the camera. The controller is configured to generate a text-based description for each video frame of the plurality of consecutive video frames using an Artificial Intelligence (AI) -based video-to-language model (VLM), wherein the text-based description describes one or more of an object, activity, and/or scene captured in the respective video frame of the plurality of consecutive video frames. The controller is configured to select less than all of a plurality of consecutive video frames of the video feed as reference video frames, wherein the selection is based at least in part on a text-based description for the plurality of consecutive video frames. The controller may store a text-based description for each video frame of the plurality of consecutive video frames to the video surveillance data repository. The controller is configured to store the reference video frames to the video surveillance data repository without storing those of the plurality of consecutive video frames that are not selected as reference video frames. Another example may reside in a non-transitory computer readable medium storing instructions. The instructions, when executed by the one or more processors, cause the one or more processors to receive a video feed captured by a camera, the video feed comprising a plurality of video frames. The method includes generating, for each of a plurality of video frames, a text-based description using an Artificial Intelligence (AI) -based video-to-language model (VLM), wherein the text-based description describes one or more of an object, activity, and/or scene captured in a respective video frame of the plurality of video frames. The one or more processors are caused to select less than all of a plurality of video frames of the video feed as reference video frames, wherein the selecting is based at least in part on the text-based description for the plurality of video frames. The method includes causing one or more processor