US-20260127885-A1 - METHODS AND SYSTEM FOR AUTOMATICALLY IDENTIFYING ANOMALIES IN A VIDEO FEED
Abstract
Anomalies may be detected in a video feed that is captured by a video camera of a video surveillance system. At least part of the video feed may be fed to a Generative Multimodal Model (GMM) along with a prompt that prompts the GMM to look for anomalies occurring in at least part of the video feed. The video feed is processed by the GMM using the prompt to identify one or more anomalies occurring in the at least part of the video feed. The one or more anomalies identified by the GMM in at least part of the video feed are reported.
Inventors
- Deepika Sandeep
- Renil Austin MENDEZ
- Vishwanath Gupta
Assignees
- HONEYWELL INTERNATIONAL INC.
Dates
- Publication Date
- 20260507
- Application Date
- 20251031
- Priority Date
- 20241101
Claims (20)
- 1 . A method for identifying anomalies occurring in a video feed that is captured by a video camera of a video surveillance system, the method comprising: receiving the video feed captured by the video camera of the video surveillance system; providing at least part of the video feed to a Generative Multimodal Model (GMM); submitting a prompt to the GMM prompting the GMM to look for anomalies occurring in the at least part of the video feed; processing the video feed by the GMM using the prompt to identify one or more anomalies occurring in the at least part of the video feed; and reporting the one or more anomalies identified by the GMM in the at least part of the video feed.
- 2 . The method of claim 1 , wherein the GMM identifies one or more anomalies occurring in the at least part of the video feed without requiring anomaly specific training of the GMM for each of the one or more anomalies identified by the GMM.
- 3 . The method of claim 1 , wherein the prompt is an anomaly generic prompt that prompts the GMM to look for any anomaly determined by the GMM.
- 4 . The method of claim 1 , wherein the prompt is an anomaly specific prompt that prompts the GMM to look for a specific type of anomaly occurring in the at least part of the video feed.
- 5 . The method of claim 1 , further comprising: submitting a subsequent prompt to the GMM that is based at least in part on a selected anomaly of the one or more anomalies identified by the GMM, wherein the subsequent prompt is configured to prompt the GMM to look for anomalies occurring in the at least part of the video feed that have a same anomaly type as the selected anomaly.
- 6 . The method of claim 1 , wherein processing the video feed by the GMM using the prompt to identify one or more anomalies occurring in the at least part of the video feed comprises: generating a text-based summarization of the at least part of the video feed using a generative Vision Language Model (VLM) of the GMM; and processing the text-based summarization using a generative Large Language Model (LLM) of the GMM to identify the one or more anomalies occurring in the at least part of the video feed.
- 7 . The method of claim 6 , wherein the VLM and LLM are separate models.
- 8 . The method of claim 6 , wherein the VLM and LLM are an integrated model.
- 9 . The method of claim 6 , wherein generating the text-based summarization of the at least part of the video feed comprises generating the text-based summarization of a video clip that is extracted from the video feed and encompasses less than all of the video feed.
- 10 . The method of claim 6 , comprising: generating a text-based summarization for each of a plurality of frames of the at least part of the video feed using the generative Vision Language Model (VLM) of the GMM; and processing the text-based summarization for each of the plurality of frames of the at least part of the video feed using the generative Large Language Model (LLM) to identify the one or more anomalies occurring in the at least part of the video feed.
- 11 . The method of claim 1 , wherein the video feed includes an audio track and a video track, the method comprising: processing the audio track of at least part of the video feed with a transcript model to generate a text-based transcript of the at least part of the video feed; and processing the text-based transcript of the audio track of the at least part of the video feed and the video track of the at least part of the video feed by the GMM using the prompt to identify one or more anomalies occurring in the at least part of the video feed.
- 12 . The method of claim 11 , where reporting the one or more anomalies identified by the GMM comprises: reporting a summarization of audio anomalies identified by the GMM in the at least part of the video feed; and reporting a summarization of video anomalies identified by the GMM in the at least part of the video feed.
- 13 . The method of claim 1 , comprising: generating one or more bounding boxes that each corresponds to one of the one or more anomalies identified by the GMM; and overlaying the one or more bounding boxes on the video feed to visually identify each of the one or more anomalies identified by the GMM in the video feed.
- 14 . A video surveillance system comprising: a video camera that generates a video feed; a controller operatively coupled to the video camera, the controller configured to: receive the video feed captured by the video camera; provide at least part of the video feed to a Generative Multimodal Model (GMM); submit a prompt to the GMM prompting the GMM to look for anomalies occurring in the at least part of the video feed; process the video feed with the GMM using the prompt to identify one or more anomalies occurring in the at least part of the video feed; and report the one or more anomalies identified by the GMM in the at least part of the video feed.
- 15 . The video surveillance system of claim 14 , wherein the GMM is configured to identify one or more anomalies occurring in the at least part of the video feed without requiring anomaly specific training of the GMM for each of the one or more anomalies identified by the GMM.
- 16 . The video surveillance system of claim 14 , wherein processing the video feed by the GMM using the prompt to identify one or more anomalies occurring in the at least part of the video feed comprises: the controller generating a text-based summarization of the at least part of the video feed using a generative Vision Language Model (VLM) of the GMM; and the controller processing the text-based summarization using a generative Large Language Model (LLM) of the GMM to identify the one or more anomalies occurring in the at least part of the video feed.
- 17 . The video surveillance system of claim 16 , comprising: the controller generating a text-based summarization for each of a plurality of frames of the at least part of the video feed using the generative Vision Language Model (VLM) of the GMM; and the controller processing the text-based summarization for each of the plurality of frames of the at least part of the video feed using the generative Large Language Model (LLM) to identify the one or more anomalies occurring in the at least part of the video feed.
- 18 . A method for identifying anomalies occurring in a video feed that is captured by a video camera of a video surveillance system, the method comprising: receiving the video feed captured by the video camera of the video surveillance system; providing at least part of the video feed to a Vision Language Model (VLM); the VLM generating a text-based summarization of the at least part of the video feed; processing the text-based summarization of the at least part of the video feed via a generative Large Language Model (LLM) to identify one or more anomalies occurring in the at least part of the video feed; and reporting the one or more anomalies identified by the LLM.
- 19 . The method of claim 18 , wherein: the VLM generating a plurality of text-based summarizations one for each of a plurality of sequential video clips of the at least part of the video feed; receiving a user query; and submitting a prompt to the LLM that is based at least in part on the user query, wherein the LLM processes the plurality of text-based summarizations along with the prompt to identify one or more of the plurality of sequential video clips that match the user query.
- 20 . The method of 18 , comprising processing the text-based summarization of the at least part of the video feed via the generative Large Language Model (LLM) resulting in a prediction of an occurrence of a future event before the future event occurs.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS This application claims priority pursuant to 35 U.S.C. 119(a) to India patent application No. 202411083703, filed Nov. 1, 2024, which application is incorporated herein by reference in its entirety. TECHNICAL FIELD The present disclosure relates generally to video surveillance systems. More particularly, the present disclosure relates to automatically identifying anomalies in a video feed provided by a video surveillance system. BACKGROUND Video surveillance systems can include a substantial number of video cameras, each of the video cameras producing video streams. In systems with hundreds or even thousands of video cameras, monitoring all of these video streams can be a daunting task. Having operators view all of the video streams can be an expensive, time-consuming process. What would be desirable are ways to use artificial intelligence to look for anomalies in the video feeds. What would be desirable are ways to automatically find anomalies and present the anomalies to an operator for confirmation without having to first train an AI model for each type of anomaly. SUMMARY The present disclosure relates generally to video surveillance systems. More particularly, the present disclosure relates to automatically identifying anomalies in a video feed provided by a video surveillance system. An example may be found in a method for identifying anomalies occurring in a video feed that is captured by a video camera of a video surveillance system. The method includes receiving the video feed captured by the video camera of the video surveillance system and providing at least part of the video feed to a Generative Multimodal Model (GMM). A prompt is submitted to the GMM prompting the GMM to look for anomalies occurring in the at least part of the video feed. The video feed is processed by the GMM using the prompt to identify one or more anomalies occurring in the at least part of the video feed. The method includes reporting the one or more anomalies identified by the GMM in the at least part of the video feed. Another example may be found in a video surveillance system. The video surveillance system includes a video camera that generates a video feed and a controller that is operatively coupled to the video camera. The controller is configured to receive the video feed captured by the video camera and to provide at least part of the video feed to a Generative Multimodal Model (GMM). The controller is configured to submit a prompt to the GMM prompting the GMM to look for anomalies occurring in the at least part of the video feed. The controller is configured to process the video feed with the GMM using the prompt to identify one or more anomalies occurring in the at least part of the video feed. The controller is configured to report the one or more anomalies identified by the GMM in the at least part of the video feed. Another example may be found in a method for identifying anomalies occurring in a video feed that is captured by a video camera of a video surveillance system. The method includes receiving the video feed captured by the video camera of the video surveillance system and providing at least part of the video feed to a Vision Language Model (VLM). The VLM generates a text-based summarization of the at least part of the video feed. The text-based summarization of the at least part of the video feed is processed via a generative Large Language Model (LLM) to identify one or more anomalies occurring in the at least part of the video feed. The method includes reporting the one or more anomalies identified by the LLM. The preceding summary is provided to facilitate an understanding of some of the innovative features unique to the present disclosure and is not intended to be a full description. A full appreciation of the disclosure can be gained by taking the entire specification, claims, figures, and abstract as a whole. BRIEF DESCRIPTION OF THE FIGURES The disclosure may be more completely understood in consideration of the following description of various examples in connection with the accompanying drawings, in which: FIG. 1 is a schematic block diagram showing an illustrative video surveillance system; FIGS. 2A and 2B are flow diagrams that together show an illustrative method for identifying anomalies occurring in a video feed; FIG. 3 is a flow diagram showing an illustrative method for identifying anomalies occurring in a video feed; FIG. 4 is a schematic drawing showing an illustrative architecture for a frame-by-frame analysis algorithm; FIG. 5 is a schematic drawing showing an illustrative architecture for video anomaly analysis; FIG. 6 is a schematic drawing showing an illustrative example of video indexing and video log searching for specific video clips; and FIG. 7 is a schematic drawing showing an illustrative example of a predictive maintenance use case using the architecture shown in FIG. 4. While the disclosure is amenable to various modifications and