CN-121982607-A - Intelligent monitoring and early warning system and method for scene understanding enhancement

CN121982607ACN 121982607 ACN121982607 ACN 121982607ACN-121982607-A

Abstract

The invention discloses an intelligent monitoring and early warning system and method for scene understanding enhancement, wherein the system comprises a scene semantic understanding module, a scene graph construction and relationship reasoning module, a scene common sense knowledge base enhancement module and a context sensing abnormality detection module, wherein the scene semantic understanding module is used for extracting scene context, target features and behavior features, the scene graph construction and relationship reasoning module is used for organizing discrete features into a structured scene graph and carrying out relationship reasoning, the scene common sense knowledge base enhancement module is used for injecting scene common sense knowledge, and the context sensing abnormality detection module is used for carrying out abnormality assessment and early warning by synthesizing multidimensional information. According to the method, the scene-behavior-object ternary relation model is constructed, the scene common knowledge base is fused, the depth semantic understanding and the abnormal detection of the context awareness of the monitoring video are realized, the false alarm rate is remarkably reduced, and the early warning capability under a complex scene is improved.

Inventors

LI YIPENG
SHAO XINQING
WU HAO
ZHANG PING
ZHOU HONGWEI

Assignees

江苏润和软件股份有限公司

Dates

Publication Date: 20260505
Application Date: 20260122

Claims (10)

1. The intelligent monitoring and early warning system for scene understanding enhancement is characterized by comprising a scene semantic understanding module, a scene graph construction and relationship reasoning module, a scene common knowledge base enhancement module and a context awareness abnormality detection module, wherein the scene semantic understanding module is used for extracting scene context, target features and behavior features, the scene graph construction and relationship reasoning module is used for organizing discrete features into a structured scene graph and carrying out relationship reasoning, the scene common knowledge base enhancement module is used for injecting scene common knowledge, the context awareness abnormality detection module is used for carrying out abnormality assessment and early warning on comprehensive multidimensional information, and each module is used for realizing information transmission through scene context vectors and gradually enhanced node features to form an end-to-end intelligent early warning flow.
2. The intelligent monitoring and early warning system with enhanced scene understanding according to claim 1, wherein the scene semantic understanding module is responsible for extracting multi-level semantic information from an original monitoring video to provide a structured scene representation for a subsequent module, wherein the scene semantic understanding module comprises three closely related sub-modules of scene classification, object detection and behavior recognition, wherein the outputs of the three sub-modules of scene classification, object detection and behavior feature extraction together form the multi-level semantic representation of a scene, and wherein the scene context vector Providing global context, target features Describing static properties, behavioral characteristics of each target And the three types of characteristics are fused into unified node representation in a scene graph construction and relation reasoning module to realize the conversion from a discrete target to a structured scene graph.
3. The intelligent monitoring and early warning system with enhanced scene understanding according to claim 2, wherein the scene graph construction and relationship reasoning module receives the scene context vector output by the scene semantic understanding module Target feature set And a set of behavioral characteristics Organizing the discrete semantic information into a structured scene graph representation, and carrying out multi-layer relation reasoning through a graph neural network to obtain final node characteristics The depth understanding of the interaction relation and the whole semantics among the targets in the scene is realized.
4. The intelligent monitoring and early warning system with enhanced scene understanding as set forth in claim 3, wherein the scene common knowledge base enhancement module receives node characteristics output by the scene graph construction module And scene context vector Through a pre-constructed scene knowledge graph, normal behavior modes, abnormal behavior definitions and object interaction rules under different scenes are injected, and node characteristics with enhanced knowledge are output The context awareness anomaly detection module is used.
5. The intelligent monitoring and early warning system with enhanced scene understanding as set forth in claim 4, wherein the context-aware anomaly detection module receives knowledge-enhanced node characteristics Scene context vector And scene graph structure And (3) integrating scene semantics, target relationships and common knowledge, evaluating the degree of abnormality of each target and the whole scene, and outputting risk level and early warning information.
6. An intelligent monitoring and early warning method with enhanced scene understanding based on the system as claimed in any one of claims 1 to 5, characterized by comprising the following steps: s1, scene semantic understanding; s2, constructing a scene graph and reasoning relation; S3, enhancing a scene common knowledge base; S4, detecting context awareness abnormality.
7. The intelligent monitoring and early warning method for enhancing scene understanding according to claim 6, wherein the step S1 specifically comprises: s11, scene classification and context coding Firstly, classifying the scene of a monitoring video, identifying the scene type of the current monitoring area, extracting the global characteristic of the video frame by adopting a pre-trained visual transducer model, outputting the scene type probability distribution by a scene classifier, and carrying out scene context vector The calculation formula of (2) is as follows: Wherein, the Representing an input video frame; ( ) Representing a visual transducer feature extraction function; And Respectively a weight matrix and a bias vector of the scene classifier; outputting a feature dimension for ViT; is a scene context vector; is the total number of scene categories; Scene context vector Not only the scene type is identified, but also the scene type is used as global context information of all subsequent modules; s12, multi-target detection and attribute extraction Detecting personnel, vehicles and object targets by adopting an improved YOLO or fast R-CNN model, and extracting the position, category and appearance characteristics of each target; For the first The number of detection targets to be detected is the number of detection targets, , For the total number of detected targets, the characteristics are expressed as From position features Category characteristics And appearance characteristics Splicing to form: Wherein, the Representing bounding box center coordinates of an object And size ; For category embedding features, mapping the category ID into a dense vector through an embedding layer; Embedding dimensions for the category; depth appearance features extracted from the target region; Is the dimension of the appearance characteristic; representing vector concatenation operations, so that the overall dimension of the target feature is ; S13, extracting space-time behavior characteristics To capture the dynamic behavior of the target, the 3D convolutional neural network or video transducer is used to extract the space-time characteristics of the continuous video frames for the time window Video clip in, the first Space-time behavioral characteristics of individual targets The calculation is as follows: Wherein, the Representing within a time window A sequence of frame sequential video frames; ( ) Representing a 3D convolution feature extraction function at a target location Extracting space-time characteristics around; Is a behavior feature vector; is a behavioral characteristic dimension.
8. The intelligent monitoring and early warning method for enhancing scene understanding according to claim 7, wherein the step S2 specifically comprises: s21, scene graph construction Scene graph By node sets Sum edge set Nodes represent entities in the scene, and edges represent relationships among the entities; for detected The object features from the scene semantic understanding module are firstly used for the object Behavior characteristics And scene context vector Fusing to generate initial node features, the first The initial characteristic calculation formula of each node is as follows: Wherein, the And A weight matrix and a bias vector for feature fusion; the node characteristic dimension is; Representing vector concatenation operations, scene context vectors Is integrated into each node characteristic to ensure that the node representation contains scene context information Representing that this is the initial node feature before graph neural network reasoning; the construction of the edges is based on the spatial distance and semantic association between the targets, for the nodes Sum node Edge weight The calculation is as follows: Wherein, the Is a node And Relative position coding between the two; encoding dimensions for relative positions; A parameter vector calculated for the edge weights; normalizing edge weights to Sigmoid activation functions Interval when Exceeding a preset threshold At the time of node And Creating an edge between them, and connecting the edge Adding edge sets ; S22, graph annotation meaning network reasoning After the scene graph is constructed, the graph attention network is adopted to carry out multi-layer relation reasoning, the node characteristics are updated to integrate the information of the neighbor nodes, and the graph attention network is arranged to be shared Layer (a) The node characteristic updating formula of the layer is as follows: Wherein, the Representing nodes Is a neighbor node set; Is the first A weight matrix of the layer; is an activation function; For the attention coefficient, the node is represented Opposite node The attention coefficient is calculated by an attention mechanism and meets the normalization condition Ensuring that the sum of the contribution weights of all neighbor nodes is 1; Through the process of After layer drawing annotation force network reasoning, final node characteristics are obtained Each of which is Each node feature fuses the information of the neighbor nodes, captures the interaction relation and the whole semantics among targets in the scene, and further injects scene common knowledge.
9. The intelligent monitoring and early warning method for enhancing scene understanding according to claim 8, wherein the step S3 specifically comprises: S31, constructing scene knowledge graph Scene knowledge graph By entity sets Relationship set Triplet set The entity comprises scene type, behavior type, object type and risk level, and the relation comprises scene-normal behavior, scene-abnormal behavior and behavior-risk level; s32, knowledge embedding and fusion Mapping the entity and relation in the knowledge graph to a low-dimensional vector space by adopting a knowledge graph embedding method, and for the triplet The embedded vector satisfies the translation relationship: Wherein, the The embedded vectors are respectively a head entity, a relation and a tail entity; Embedding dimensions for knowledge; Representing approximately equal relationships in vector space; in the case of anomaly detection, according to the current scene category And the detected behavior type, retrieving relevant common sense rules from the knowledge graph, in particular for the first Inquiring the attribute of the behavior under the current scene in the knowledge graph according to the behavior category identified by the behavior characteristics of the target to obtain a corresponding knowledge embedding vector Then through linear mapping Mapping it to node feature space to obtain Fusing the knowledge embedding vector with the scene graph node characteristics, and enhancing the semantic representation of the nodes: Wherein, the Node characteristics output by the graph annotation force network; embedding vectors for the mapped knowledge; The knowledge fusion weight coefficient is used for controlling the injection intensity of knowledge information; is the node characteristic after knowledge enhancement.
10. The intelligent monitoring and early warning method for enhancing scene understanding according to claim 8, wherein the step S4 specifically comprises: s41, calculating an abnormality degree score For the first Individual targets from which node features are enhanced Extracting three-dimensional abnormality index from the obtained product, and integrating abnormality score From visual abnormality degree Degree of behavioral abnormality And knowledge anomaly degree Weighted fusion is obtained: Wherein, the Scoring the composite anomaly; Is a weight coefficient, satisfies For balancing the importance of three dimensions; Visual anomaly degree Based on the matching degree calculation of the appearance characteristics of the target and the scene context, whether the appearance attribute of the target accords with the current scene or not is measured Based on the deviation degree calculation of the behavior characteristics and the normal behavior mode, the knowledge abnormality degree is obtained by comparing the deviation degree calculation with the pre-trained normal behavior distribution Based on the definition calculation of the abnormal behavior in the knowledge graph, directly reflecting the abnormal judgment of the behavior under the current scene by the knowledge base; scene overall anomaly degree The anomaly degree of all targets and the anomaly degree of the relation between the targets are obtained by aggregation: Wherein, the Weight coefficient for relation anomaly degree, first term Capturing the most abnormal object in the scene, a second item Capturing the situation that the difference of the abnormality degree among targets is large; The edge weight is used for weighting the abnormality degree difference between different target pairs; S42, risk classification and early warning Scoring according to overall anomaly of a scene The system classifies risk into three classes, low, medium, and high: Low risk: normal monitoring is carried out, and early warning is not needed; Risk of (1): sending out prompt and early warning, and recording an event; High risk: Triggering an alarm immediately to inform security personnel; Wherein, the And As a risk threshold value, 。

Description

Intelligent monitoring and early warning system and method for scene understanding enhancement Technical Field The invention relates to the field of computer vision and artificial intelligence, in particular to an intelligent monitoring and early warning system and method for enhancing scene understanding, which are characterized in that by fusing video streams, scene semantics, space-time relations and common sense knowledge bases, the method and the device realize deep understanding and intelligent early warning of the monitoring scene, and are suitable for various application scenes such as public safety, industrial safety, intelligent communities and the like. Background With the popularization of video monitoring technology, monitoring cameras have been widely deployed in public transportation, industrial parks, commercial places, communities and other scenes. The traditional monitoring system mainly relies on an abnormal detection algorithm which is manually attended or based on rules, and has the following technical defects: the semantic understanding capability is insufficient, and the existing monitoring system can only identify object types (such as 'people', 'vehicles') and simple behaviors (such as 'running', 'falling'), and cannot understand the scene context of the behaviors. For example, the action of "running on a person" is normal at a playground, but may mean robbery or emergency at a banking lobby, and conventional systems cannot distinguish this semantic difference. The false alarm rate is high, and the rule-based anomaly detection method relies on a manually set threshold value and rules, so that the method is difficult to adapt to complex and changeable real scenes. For example, defining "fast movement" as abnormal may result in a number of normal running behaviors being misreported, while relaxing the rules may miss the true abnormal event. Lack of common sense reasoning-traditional systems cannot make use of common sense of scenario to reason. For example, in a subway platform, one person standing outside a yellow line for a long time may have a risk of track jump, and in a parking lot, one person wandering beside a plurality of vehicles may be a precursor to theft. These decisions require a comprehensive analysis combining knowledge of the scene and patterns of behavior. The space-time correlation analysis is weak, namely the existing system analyzes a plurality of single-frame or short-time video clips, lacks modeling capability for long-time track, multi-target interaction relation and inter-camera behavior continuity of personnel, and is difficult to identify complex abnormal behaviors such as trailing, gathering, collaborative work and the like. The query and search efficiency is low, the traditional monitoring system mainly relies on a time stamp and a camera number to perform video search, and natural language semantic query cannot be supported, such as 'search video of B, which is accessed by people wearing red clothes in the afternoon yesterday', so that the post-investigation efficiency is low. Therefore, an intelligent monitoring and early warning system which can deeply understand scene semantics, perform reasoning by combining common knowledge and support space-time correlation analysis and natural language query is urgently needed, so that the intelligent level and practical value of the monitoring system are improved. Disclosure of Invention In order to solve the problems, the invention provides an intelligent monitoring and early warning system and method for enhancing scene understanding, which are used for realizing the depth semantic understanding and the abnormal detection of context awareness of a monitoring video by constructing a scene-behavior-object ternary relation model and fusing a scene common knowledge base, obviously reducing the false alarm rate and improving the early warning capability under a complex scene. The specific scheme is as follows: The intelligent monitoring and early warning system for scene understanding enhancement comprises a scene semantic understanding module, a scene graph construction and relationship reasoning module, a scene common knowledge base enhancement module and a context awareness abnormality detection module, wherein the scene semantic understanding module is used for extracting scene context, target features and behavior features, the scene graph construction and relationship reasoning module is used for organizing discrete features into a structured scene graph and conducting relationship reasoning, the scene common knowledge base enhancement module is used for injecting scene common knowledge, the context awareness abnormality detection module is used for carrying out abnormality assessment and early warning on comprehensive multidimensional information, and each module is used for realizing information transfer through scene context vectors and gradually enhanced node features to form an end-to-end intelligent early