KR-102959954-B1 - Method and Device for Selective Surveillance Based on Event Risk by Scene Categories

KR102959954B1KR 102959954 B1KR102959954 B1KR 102959954B1KR-102959954-B1

Abstract

The present invention relates to an intelligent selective monitoring technology that utilizes artificial intelligence technology in a video monitoring system to automatically classify captured video based on the situation, dynamically determine the risk level of events, and apply differential monitoring priorities. In particular, the present invention relates to an intelligent selective monitoring technology that improves monitoring efficiency by processing captured video from a surveillance camera using a Vision-Language Model (VLM) to classify scene categories of the captured video through image embedding clustering, calculating event risk levels by combining event types detected in real-time from the captured video with scene categories, and applying monitoring priorities and selective alerts based on this. According to the present invention, there is an advantage in being able to implement intelligent selective monitoring based on the spatial characteristics of scenes, departing from the existing intelligent video analysis method that generated events according to uniform criteria.

Inventors

최현배
이성진

Assignees

이노뎁 주식회사

Dates

Publication Date: 20260508
Application Date: 20260116

Claims (6)

A selective monitoring method based on event risk according to scene categories, performed by a computing device to provide selective monitoring that classifies captured video generated by multiple surveillance cameras in a video monitoring system based on the situation, calculates the risk level of events, and applies differential monitoring priorities, wherein (a) A step of selecting representative scene frames for each surveillance camera and collecting multiple representative scene frames from the captured images generated by the multiple surveillance cameras; (b) generating a scene embedding vector in which the spatial characteristics of the place are semantically reflected from the representative scene frame by a vision-language model (VLM); (c) a step of generating a plurality of scene clusters by clustering the plurality of representative scene frames according to spatial characteristics based on the similarity of the scene embedding vectors; (d) a step of obtaining spatial characteristic text of the scene clusters through a vision-language model (VLM) and determining a scene category for each scene cluster based thereon; (e) A step of determining a scene category for each surveillance camera corresponding to the scene cluster to which each representative scene frame belongs; (f) A step of establishing a risk weight table corresponding to the scene category for each surveillance camera from a pre-stored risk weight table for each scene category in which different risk weights are mapped to event types considering spatial characteristics for each scene category; (g) a step of identifying the detection of a real-time event during real-time monitoring, identifying the surveillance camera (10) in which the real-time event was detected, and applying the risk weighting table connected to the identified surveillance camera (10) to calculate the risk level of the real-time event; and (h) A step of setting the control priority of the real-time event based on the calculated risk level; A selective monitoring method based on event risk according to scene categories configured including
In claim 1, The step of generating the above scene embedding vector is, A step of inputting a text prompt requesting semantic information of the place along with the representative scene frame into a vision-language model (VLM); and A step of obtaining a multidimensional feature vector output from a vision-language model (VLM) as the scene embedding vector; A selective control method based on event risk according to scene categories, characterized by being configured to include
In claim 1, The step of determining the above scene category is, A step of inputting representative scene frames for each of the above scene clusters into a vision-language model to obtain natural language labels describing the location; and A step of determining a scene category for a scene cluster by comparing the above natural language labels with a preset list of standard categories; A selective control method based on event risk according to scene categories, characterized by being configured to include
In claim 1, The step of calculating the above risk level is, A step of identifying the detection of real-time events during real-time monitoring; A step of identifying the surveillance camera (10) in which the above real-time event was detected; A step of identifying risk weights by type of the identified real-time event from the risk weight table connected to the identified surveillance camera (10); A step of identifying situational correction coefficients based on additional information at the time of the real-time event detection; and A step of calculating the risk level of the real-time event by combining the risk weights for each type and the correction coefficients for each situation; A selective control method based on event risk according to scene categories, characterized by being configured to include
A computer program stored on a computer-readable non-volatile storage medium to execute a selective control method based on event risk according to a scene category according to any one of claims 1 to 4 on a computer.
A scene analysis selective control device based on situational weights through scene analysis, comprising: a memory for storing computer-readable instructions; and a processor for executing said instructions, wherein the processor is configured to perform a selective control method based on event risk according to a scene category according to any one of claims 1 to 4.

Description

Method and Device for Selective Surveillance Based on Event Risk by Scene Categories Method and Device for Selective Surveillance Based on Event Risk by Scene Categories The present invention relates to an intelligent selective monitoring technology that utilizes artificial intelligence technology in a video monitoring system to automatically classify captured video based on the situation, dynamically determine the risk level of events, and apply differential monitoring priorities. In particular, the present invention relates to an intelligent selective monitoring technology that improves monitoring efficiency by processing captured video from a surveillance camera using a Vision-Language Model (VLM) to classify scene categories of the captured video through image embedding clustering, calculating event risk levels by combining event types detected in real-time from the captured video with scene categories, and applying monitoring priorities and selective alerts based on this. Recently, artificial intelligence technology is being utilized in various image processing fields, including computer vision. [Fig. 1] is a general configuration diagram of an intelligent video surveillance system, and [Fig. 2] is a general conceptual diagram of an intelligent screening surveillance process. Referring to [Fig. 1], the intelligent video control system is equipped with a surveillance camera (10), a client device (50), a video control device (100), a storage device (200), a video analysis device (300), and a selective control device (400). The surveillance camera (10) is installed at multiple locations and provides captured video of the respective location to the video control device (100) in real time. The video control device (100) provides the captured video to the controller for monitoring, and the storage device (200) stores the captured video temporarily or for a long period for future verification. The video control device (100) transmits the captured video to the video analysis device (300) and the selective control device (400) to instruct video analysis. The video analysis device (300) analyzes the captured video in real time or retrospectively. The selective control device (400) analyzes the captured video based on a neural network model. Referring to [Fig. 2], the selective control device (400) inputs captured images into an object detection model to extract objects of interest. The object detection model outputs bounding boxes and class information (e.g., Human) for the detected objects. These objects of interest are tracked in the captured images to form a group of objects of interest. Then, the precision analysis model performs object attribute recognition, attribute-based search, object behavior recognition, and re-identification of identical objects. These attribute recognition, behavior recognition, and re-identification of identical objects are performed in neural network models suitable for their respective purposes (attribute recognition model, pose estimation model, behavior recognition model, Re-ID model, etc.). It is true that the efficiency of selective monitoring has increased compared to the past with the application of intelligent video analysis. However, in actual monitoring situations, the fatigue of the controller is high due to false positives and the occurrence of numerous events. This is because the selective monitoring device (400) generates events based on standardized criteria. Since the controller is notified of the occurrence of an event when a specific pre-set behavior is detected in the captured video, the controller faces a heavy burden. Accordingly, the present invention aims to reduce the number of false positives or event occurrences by reflecting the context of the situation or scene where the behavior takes place in the analysis of captured video. Taking the 'loitering' event as an example, loitering in front of a bank or ATM machine can be considered a potential risk, but loitering in front of a school main gate can be treated as cautionary behavior, and loitering at a bus stop can be interpreted as a perfectly normal behavior for using public transportation. In the case of the 'intrusion' event, intrusion in a controlled area such as a factory entrance can be considered a potential risk, but intrusion around the perimeter of a construction site can be treated as cautionary behavior, and intrusion around a park can be evaluated as not being a particular problem. Similarly, in the case of the 'stopping' event, stopping in the middle of a crosswalk can be considered a potential risk, but stopping in front of a crosswalk can be interpreted as normal behavior. Despite the fact that the same behavior carries completely different levels of risk depending on the context of the situation or scene, conventional intelligent video surveillance systems fail to distinguish between them and generate event alerts of the same level, which increases operator fatigue and reduces video surve