CN-122024319-A - Abnormal behavior identification method based on large visual model and cognitive Agent

CN122024319ACN 122024319 ACN122024319 ACN 122024319ACN-122024319-A

Abstract

The invention discloses an abnormal behavior identification method based on a large visual model and a cognitive Agent, which relates to the technical field of computer vision and comprises the steps of collecting RGB images and infrared thermal imaging IR images in a monitored scene, extracting a human body key point sequence, carrying out time sequence difference processing on the key point sequence to construct a skeleton speed vector containing a motion trend, inputting the skeleton speed vector into a Mamba model, adjusting model parameters by using a selective state space mechanism, carrying out long time sequence modeling on an action sequence, predicting the speed variation of the next frame by using a residual prediction framework, reconstructing human body gestures, calculating an abnormal score according to a prediction error, triggering the cognitive Agent when the score exceeds a preset threshold, and calling the large visual model VLM to carry out semantic analysis on the current scene environment to generate a natural language warning containing an abnormal type and an environment incentive. The invention solves the problem of forgetting in long sequences and the problem of jitter of the prediction actions of the traditional model.

Inventors

DING YIXUAN
WU SONG
SUN ZHEYI
LI ZIWEN
Yue Zhixu
CHEN GEN
LIANG RUYU

Assignees

西南大学

Dates

Publication Date: 20260512
Application Date: 20260129

Claims (8)

1. The abnormal behavior identification method based on the visual large model and the cognitive Agent is characterized by comprising the following steps of: S1, collecting RGB images and infrared thermal imaging IR images in a monitoring scene, and extracting a human body key point sequence by utilizing YOLOv network; s2, performing time sequence difference processing on the key point sequence to construct a skeleton velocity vector containing a motion trend; S3, inputting a skeleton speed vector into Mamba a model, adjusting model parameters by utilizing a selective state space mechanism SELECTIVE SSM, and carrying out long-time sequence modeling on an action sequence; the step S3 comprises the following substeps: s31, constructing a Mamba architecture-based sequence modeling layer, wherein the layer comprises an input-dependent selective state space mechanism SELECTIVE SSM for dynamically generating state space parameters according to action characteristics at the current moment; s32, mapping the input action characteristics into time step parameters, a control matrix and an output matrix through a linear projection layer; S33, converting continuous motion dynamics parameters into discrete state parameters by using a discretization formula, performing recursion calculation on a long sequence by using a parallel scanning algorithm, and capturing the dependency relationship of different frames; s4, predicting the speed variation of the next frame by adopting a residual error prediction framework, reconstructing the human body posture, and realizing motion smoothing by physical consistency constraint; S5, calculating an abnormal score according to the prediction error, triggering a cognitive Agent when the score exceeds a preset threshold, calling a visual large model VLM to perform semantic analysis on the current scene environment, and generating a natural language warning containing an abnormal type and an environment incentive; the step S5 comprises the following substeps: s51, calculating the mean square error between a skeleton sequence predicted by a model and a real observation sequence, and generating a normalized motion anomaly score; S52, when the motion anomaly score exceeds a preset safety threshold, activating a cognitive Agent module, intercepting RGB key frames at the moment of anomaly occurrence, and inputting the RGB key frames into a visual large model VLM; S53, constructing a structured prompt word, instructing the visual large model VLM to identify environment risk elements in the scene, and carrying out logical reasoning by combining the motion anomaly scores to generate a natural language warning containing anomaly types and environment causes.
2. The abnormal behavior recognition method based on a large visual model and a cognitive Agent according to claim 1, wherein the step S1 comprises the following sub-steps: s11, constructing a dual-mode acquisition front end, synchronously acquiring visible light RGB and infrared thermal IR video streams, and eliminating time sequence deviation among modes through a time stamp alignment technology to form an aligned dual-mode frame sequence; S12, inputting the aligned bimodal frame sequence into YOLOv target detection network, and fusing the texture features of RGB and the heat radiation features of IR through a cross-modal attention mechanism; and S13, carrying out human body detection and posture estimation on each frame of image based on the fusion result, and extracting two-dimensional coordinate data containing key points of the human body to form an original skeleton position sequence.
3. The abnormal behavior recognition method based on a large visual model and a cognitive Agent according to claim 1, wherein the step of inputting the aligned bimodal frame sequence to YOLOv target detection network, fusing the texture features of RGB and the heat radiation features of IR through a cross-modal attention mechanism comprises the following sub-steps: S121, constructing a main network of YOLOv based on CSPDARKNET cross-stage local network architecture, wherein the main network consists of a plurality of CSP bottleneck modules connected in series and a downsampling convolution layer, and splicing input bimodal images in a channel dimension to form a composite input tensor; S122, inputting the composite input tensor to a first-layer convolution module of a backbone network, performing primary downsampling and channel expansion on the composite input tensor, mapping image data into an initial feature map, and completing conversion from an image space to a feature space; s123, using a convolution layer, a BN layer and SiLU activation functions to connect in series to form a CBS module, using the CBS module and a CSP bottleneck module to perform continuous convolution operation, performing alternating feature extraction and step convolution downsampling on an initial feature map, sequentially outputting a shallow texture feature map F1 with the resolution of 1/8 of the original figure and a middle contour feature map F2 with the resolution of 1/16, and finally introducing a spatial pyramid pooling SPPF module to perform maximum pooling aggregation on deep features, and outputting a deep semantic feature map F3 with the resolution of 1/32; S124, constructing a path aggregation network PANet comprising a bidirectional fusion path by taking feature graphs F1, F2 and F3 output by a backbone network as input references, taking a neck network YOLOv as the neck network, performing channel dimension splicing with a middle-layer feature graph F2 after nearest neighbor interpolation upsampling on a deep semantic feature graph F3, performing channel dimension splicing with a shallow-layer feature graph F1 after continuous upsampling on the fusion feature, constructing a top-down semantic propagation path, performing downsampling on the fused feature by using convolution operation, performing secondary splicing fusion with the deep feature, constructing a bottom-up positioning enhancement path, and outputting a fusion feature pyramid comprising rich semantic and position information.
4. The abnormal behavior recognition method based on a large visual model and a cognitive Agent according to claim 3, wherein the human body detection and posture estimation are performed on each frame of image based on the fusion result, two-dimensional coordinate data containing key points of the human body are extracted, and an original skeleton position sequence is formed, and the method comprises the following steps: S131, inputting a fusion feature pyramid into a decoupling detection head YOLOv, wherein the detection head is designed into a decoupling framework and comprises three parallel branches of classification, bounding box regression and key point regression, in the key point regression branch, a direct regression strategy based on heat map features is adopted, and aiming at each detected target anchor frame, the network uses a grid unit of a current feature map as a local reference system to directly predict the coordinate offset of key points of human bones relative to the geometric center of the grid unit of the current feature map, so as to cover main joints of the head, the trunk and four limbs; s132, representing a probability value of the current prediction target belonging to the human body category based on a confidence score of each prediction frame output by the classification branch in the detection head; s133, stacking skeleton coordinate sets of continuous frames in a time dimension according to the time stamp sequence of the video frames to construct an original skeleton position sequence.
5. The abnormal behavior recognition method based on the vision large model and the cognitive Agent according to claim 4, wherein the step S2 comprises the following sub-steps: S21, performing cross-frame identity association on the detected target to construct a single continuous skeleton track; s22, aiming at the problem of discontinuous first frame in motion prediction, calculating the key point displacement difference between two continuous frames as a speed characteristic; and S23, splicing the absolute position coordinates of the key points and the speed characteristics to construct a high-dimensional motion characteristic vector.
6. The abnormal behavior recognition method based on the visual large model and the cognitive Agent according to claim 5, wherein the step of performing cross-frame identity association on the detected target to construct a single continuous skeleton track comprises the following sub-steps: s211, establishing a Kalman filtering state vector and an error covariance matrix for each detected target; S212, calculating the cross ratio between the predicted frame position of the Kalman filter and the detection frame position output by YOLOv in the current frame, and constructing an association cost matrix; S213, performing global optimal bipartite graph matching on the association cost matrix by using a Hungary algorithm, and distributing a current frame detection frame to a corresponding historical track sequence based on an optimal matching result; S214, correcting the accumulated error of the Kalman filtering by adopting an observation center momentum recovery mechanism aiming at the target which is re-appeared after being blocked or temporarily lost; S215, directly assigning the corrected speed vector to the speed component in the Kalman filtering state vector at the current moment, forcedly correcting the state estimation, executing reset operation on the error covariance matrix, resetting the value of the error covariance matrix to a preset numerical value representing high uncertainty by locating diagonal elements corresponding to speed state variables in the matrix, and simultaneously zeroing off-diagonal covariance elements related to the speed, and finally outputting a time continuous skeleton coordinate sequence exclusive to a target.
7. The abnormal behavior recognition method based on the vision large model and the cognitive Agent according to claim 1, wherein the step S4 comprises the following sub-steps: S41, predicting a speed residual error at the next moment by adopting a residual error prediction head from the sequence to the sequence Seq2 Seq; s42, reconstructing the gesture at the next moment through integration based on the continuity constraint and the speed residual error of the physical movement, and eliminating the jitter of the prediction action.
8. A system of abnormal behavior recognition method based on a visual large model and cognitive Agent according to any one of claims 1-7, wherein the system comprises: The data acquisition module is used for acquiring RGB images and infrared thermal imaging IR images in a monitored scene, extracting a human body key point sequence by utilizing YOLOv network, carrying out time sequence differential processing on the key point sequence, and constructing a skeleton velocity vector containing a motion trend; The time sequence modeling module is used for inputting the skeleton speed vector into Mamba models, adjusting model parameters by utilizing a selective state space mechanism SELECTIVE SSM, and performing long time sequence modeling on the action sequence; And the cognition analysis module is used for calculating an abnormal score according to the prediction error, triggering the cognition Agent when the score exceeds a preset threshold value, calling the visual large model VLM to perform semantic analysis on the current scene environment, and generating a natural language warning containing the abnormal type and the environment incentive.

Description

Abnormal behavior identification method based on large visual model and cognitive Agent Technical Field The invention relates to the technical field of computer vision, in particular to an abnormal behavior identification method based on a large vision model and a cognitive Agent. Background Abnormal behavior identification is a core task of an intelligent security monitoring system, and aims to detect and early warn sudden events in real time from complex video streams. With the promotion of smart city construction, the technology has extremely wide application value in the fields of public safety, nursing for the aged and the like. Although some progress has been made in the existing behavior recognition methods based on deep learning, two challenges remain when facing long-period continuous motion monitoring and semantic understanding in complex environments: The first challenge is the efficiency and memory bottleneck problem of long time sequence modeling, the gradient disappearance and forgetting phenomenon easily occur when the traditional cyclic neural network processes long sequences, while the model based on the Transformer has stronger memory capacity, but the calculation complexity of the model increases in a quadratic way along with the length of the sequences, the reasoning speed is slow, and the real-time requirement of the edge equipment is difficult to meet; The second challenge is the lack of environmental semantic understanding and physical consistency constraints. In the prior art, the method relies on pure skeleton coordinate regression, and the generated predicted track often has jitter or first frame jump phenomenon and does not accord with the physical inertia of human body movement. Meanwhile, the current system lacks 'cognition' capability, only can output abnormal probability scores, and cannot be logically inferred by combining with surrounding environments. In another example, the method and the device for identifying abnormal behaviors of multi-view instance-semantic consensus mining are disclosed as CN117690192A, and the method is used for respectively extracting features by constructing an instance encoder and a semantic encoder and maximizing the consistency of multi-view data in a feature space by combining contrast learning and a semantic distillation mechanism. However, the method mainly aims at solving the problems of spatial alignment and consensus of static image features under multiple angles of view, and lacks modeling of a long time sequence dynamic evolution process of human motion, so that the evolution trend and time sequence logic of the motion are difficult to capture. Disclosure of Invention Aiming at the defects in the prior art, the abnormal behavior identification method based on the visual large model and the cognitive Agent provided by the invention solves two major challenges facing the prior art. In order to achieve the aim of the invention, the technical scheme adopted by the invention is that the abnormal behavior identification method based on the large visual model and the cognitive Agent comprises the following steps: S1, collecting RGB images and infrared thermal imaging IR images in a monitoring scene, and extracting a human body key point sequence by utilizing YOLOv network; s2, performing time sequence difference processing on the key point sequence to construct a skeleton velocity vector containing a motion trend; S3, inputting a skeleton speed vector into Mamba a model, adjusting model parameters by utilizing a selective state space mechanism SELECTIVE SSM, and carrying out long-time sequence modeling on an action sequence; s4, predicting the speed variation of the next frame by adopting a residual error prediction framework, reconstructing the human body posture, and realizing motion smoothing by physical consistency constraint; s5, calculating an abnormal score according to the prediction error, triggering a cognitive Agent when the score exceeds a preset threshold, calling a visual large model VLM to perform semantic analysis on the current scene environment, and generating a natural language warning containing the abnormal type and the environment incentive. Further, the step S1 includes the following sub-steps: s11, constructing a dual-mode acquisition front end, synchronously acquiring visible light RGB and infrared thermal IR video streams, and eliminating time sequence deviation among modes through a time stamp alignment technology to form an aligned dual-mode frame sequence; S12, inputting the aligned bimodal frame sequence into YOLOv target detection network, and fusing the texture features of RGB and the heat radiation features of IR through a cross-modal attention mechanism; and S13, carrying out human body detection and posture estimation on each frame of image based on the fusion result, and extracting two-dimensional coordinate data containing key points of the human body to form an original skeleton position sequence. Further,