CN-122024131-A - Unsupervised video anomaly detection method, unsupervised video anomaly detection system, computer device and storage medium

CN122024131ACN 122024131 ACN122024131 ACN 122024131ACN-122024131-A

Abstract

The invention discloses an unsupervised video anomaly detection method, a system, computer equipment and a storage medium, wherein the method comprises the steps of extracting and fusing characteristics, namely, detecting and tracking human body posture of a video sequence to obtain the human body posture characteristics, acquiring video scene characteristics, and fusing the human body posture characteristics and the scene characteristics to generate fusion characteristics; the method comprises the steps of differential information learning, namely inputting fusion characteristics to a differential information learning module based on a reversible density transformation flow, carrying out probability density modeling, outputting action abnormal scores, consistent information learning, namely counting the number of visible frames of pedestrians in each scene, calculating scene safety priori values, heterogeneous information fusion and anomaly detection, namely carrying out weighted fusion on the action abnormal scores and the scene safety priori values to generate corrected anomaly scores, and taking the minimum value of all target anomaly scores of a current frame as a final anomaly judgment result of the frame. The method is subjected to double constraint of action semantics and scene risk, and reliable abnormality judgment is realized.

Inventors

WANG XIAO
Yin Awei
LIU WEI
WANG WEI
LI WEIGANG
XU XIN

Assignees

武汉科技大学

Dates

Publication Date: 20260512
Application Date: 20260120

Claims (10)

1. An unsupervised video anomaly detection method is characterized by comprising the following steps: The method comprises the steps of extracting and fusing the characteristics, namely, detecting and tracking the human body posture of a video sequence to obtain continuous skeleton sequence characteristics, wherein the continuous skeleton sequence characteristics are used for representing the human body posture characteristics, acquiring video scene characteristics, fusing the human body posture characteristics with the scene characteristics, and generating fusion characteristics; The difference information learning is carried out, namely the generated fusion characteristics are input to a difference information learning module based on a reversible density transformation flow, probability density modeling is carried out, and action abnormal scores are output; Consistency information learning, namely based on scene characteristics, counting the number of visible frames of pedestrians in each scene, and calculating a scene safety priori value; And heterogeneous information fusion and anomaly detection, namely carrying out weighted fusion on the action anomaly score and the scene safety priori value to generate a corrected anomaly score, and taking the minimum value of all target anomaly scores of the current frame as the final anomaly judgment result of the frame.
2. The method for detecting the unsupervised video abnormality according to claim 1, wherein the detecting and tracking the human body posture of the video sequence to obtain continuous skeleton sequence features, wherein the continuous skeleton sequence features are used for representing the human body posture features, simultaneously acquiring video scene features, and fusing the human body posture features with the scene features to generate fusion features, specifically comprising: Detecting human skeleton key points of each frame in the video sequence by adopting a AlphaPose model, and tracking cross-frame human skeleton by utilizing a PoseFlow model to form continuous skeleton sequence characteristics, wherein the continuous skeleton sequence characteristics are used for representing human body posture characteristics; recording the scene category to which the video belongs by using the scene tag of the monitoring video as the scene feature; And fusing the human body posture characteristics of each frame of input data with scene characteristics to generate fusion characteristics.
3. The method for detecting the unsupervised video anomaly according to claim 1, wherein the fusion feature is represented as an undirected graph structure, the undirected graph comprises a node set and an edge set, wherein the nodes represent human skeleton key points, the edges represent association relations between the human skeleton key points, and the fusion feature is organized into four-dimensional tensors, and the four dimensions are a sample number in a batch, a time step, a human skeleton key point number, two-dimensional coordinates of the human skeleton key points, and scene labels corresponding to the input frames.
4. The method for detecting an unsupervised video anomaly according to claim 1, wherein the inputting the generated fusion features to a differential information learning module based on a reversible density transformation flow, performing probability density modeling, and outputting an action anomaly score specifically comprises: The method comprises the steps that a probability density model is built based on a difference information learning module of a reversible density transformation flow, the probability density model adopts a Glow framework and comprises 8 transformation units, each transformation unit comprises a ActNorm layer, a reversible convolution layer and an affine coupling layer, wherein the ActNorm layer normalizes an activation value by carrying out affine transformation on each channel by using independent scaling and bias parameters, the reversible convolution layer rearranges input channels in a variable replacement mode obtained through learning, the affine coupling layer divides input tensors into two parts according to channel dimensions, one part of the input tensors is kept unchanged as a reference, the other part of the input tensors adopt a space-time convolution network as a transformation module in affine transformation, and the probability density model evaluates the degree of motion deviation from a normal mode by calculating negative log likelihood loss of input samples, so that motion anomaly scores are obtained.
5. The method for unsupervised video anomaly detection according to claim 4, Log likelihood solution of input samples using probability density function shown in equation (1): (1); In the formula (1), the components are as follows, Representing input samples Model parameters in probability density model The lower log likelihood value; representing the input samples by reversible mapping Transforming the obtained potential variables; a determinant representing a jacobian matrix for measuring a degree of scaling of the volume by the reversible transform; representing absolute value operation, wherein the absolute value operation is used for ensuring that determinant value of the jacobian matrix is not negative; Representing the total number of flow change units; Represent the first Intermediate hidden variables of the individual flow transformation units; Using a negative log-likelihood function shown in formula (2) as a loss function : (2); In equation (2), the loss function The value of (2) represents the action abnormality score.
6. The method for detecting an unsupervised video anomaly according to claim 5, wherein the calculating a scene security priori value based on scene characteristics by counting the number of frames of pedestrians visible in each scene comprises: First, for each scene Accumulating the visible frames of all the fragments, and obtaining the attention factor by the formula (3) : (3); In the formula (3), the amino acid sequence of the compound, Representing a scene Is a set of fragments; Representation of Scene No Fragments; representing segments A collection of pedestrians; Representing a collection The first of (3) A pedestrian instance; And Respectively represent pedestrian examples In the segment A visible start frame and a visible end frame of the video sequence; next, the attention factor is given by equation (4) Normalized to Namely, the scene security priori value is obtained as follows: (4); In the formula (4), the amino acid sequence of the compound, Representation and pedestrian instance And the scene safety priori value after the scale calibration corresponding to the scene is positioned.
7. The method for detecting an unsupervised video anomaly according to claim 6, wherein the step of performing weighted fusion on the motion anomaly score and the scene security priori value to generate a corrected anomaly score, and taking the minimum value of all the target anomaly scores of the current frame as the final anomaly determination result of the frame specifically comprises: Introducing the scene security prior value obtained in the formula (4) into the anomaly score of the differential likelihood value obtained in the formula (2) as a deviation term according to the formula (5) Combining and normalizing the two according to preset weight, thereby obtaining corrected abnormal score: (5); in the formula (5), the amino acid sequence of the compound, Differential anomaly scores; normalizing the original anomaly score in each test sequence; Adjusting weight for the dual-branch information; For the current frame, all are being obtained Correction anomaly score for individual targets And then taking the minimum value in the abnormality scores as the final abnormality judgment score of the frame.
8. An unsupervised video anomaly detection system, comprising: The feature extraction and fusion module is configured to detect and track the human body posture of the video sequence to obtain continuous skeleton sequence features, wherein the continuous skeleton sequence features are used for representing the human body posture features, meanwhile, video scene features are acquired, and the human body posture features and the scene features are fused to generate fusion features; The difference information learning module is configured to input the generated fusion characteristics to the difference information learning module based on the reversible density transformation flow, perform probability density modeling and output action abnormality scores; the consistency information learning module is configured to count the number of visible frames of pedestrians in each scene based on scene characteristics, and calculate a scene safety priori value; The heterogeneous information fusion and anomaly detection module is configured to perform weighted fusion on the action anomaly score and the scene security priori value, generate a corrected anomaly score, and take the minimum value of all target anomaly scores of the current frame as the final anomaly determination result of the frame.
9. A computer device comprising a processor, a memory storing machine-readable instructions executable by the processor, the processor for executing the machine-readable instructions stored in the memory, the machine-readable instructions when executed by the processor performing the steps of the method of any of claims 1-7.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when run by a computer device, performs the steps of the method according to any of claims 1-7.

Description

Unsupervised video anomaly detection method, unsupervised video anomaly detection system, computer device and storage medium Technical Field The invention relates to the technical field of video understanding of computer vision, in particular to an unsupervised video anomaly detection method, an unsupervised video anomaly detection system, computer equipment and a storage medium. Background Unsupervised video anomaly detection is an important task in the field of video understanding in computer vision, aimed at automatically identifying abnormal events, such as sudden car accidents, fights, etc., with variability or risk from the normal mode, from video sequences in the absence of abnormal sample labeling. With the rapid growth of monitoring scale, manual analysis has been difficult to meet the actual demands, so there is a need for intelligent detection technology with automatic analysis capability to serve intelligent monitoring and public safety applications. The task faces various challenges including unpredictability and sparsity of the anomaly event itself, significant semantic differences between the normal and anomaly in different scenarios, and complex spatiotemporal changes in character actions, interactions and background environments. The factors make the construction of a robust and reliable anomaly detection model have higher difficulty, and the research value and application significance of the unsupervised video anomaly detection in the fields of intelligent monitoring and public safety are highlighted. In recent years, most of the unsupervised video anomaly detection methods are developed based on the thought of proxy tasks, and the core targets of the method are to describe a normal mode by constructing a learnable alternative task and judge anomalies by using the failure degree of the proxy task. Under this framework, the existing methods are mainly divided into two types, a reconstruction type method and a prediction type method. The reconstruction type method enables the model to restore the input frame or skeleton sequence by learning the space-time characteristics of the normal video. When the test sample deviates from the normal mode, the reconstruction error of the model can be obviously increased, thereby being used as an abnormal judgment basis. Common techniques include automatic encoders, convolution and loop structures, spatio-temporal attention mechanisms, reversible density transform streams, and the like. The method has simple structure and stable training, but due to the strong generalization capability of the model, abnormal samples can still be accurately reconstructed in certain cases, so that the distinguishing capability of the abnormal samples is weakened. The prediction type method predicts future frames or future actions by using time continuity, and identifies anomalies by prediction errors. The normal sequence has a stable time-space evolution rule, and the abnormal event often breaks the rule, so that the prediction and the real result have obvious deviation. The method generally adopts a time sequence convolution network, a circulation network, optical flow prediction or a generation model, and can capture stronger dynamic information to a certain extent. However, the accuracy requirement on time sequence modeling is high, and when the scene is complex or the motion is various, the prediction difficulty is obviously increased. Despite the great progress made in the field of video anomaly detection by existing research, the learning paradigm of widely dependent agent tasks still belongs to a suboptimal fitting approach. These methods generally approximate normal behavior distribution by reconstructing, predicting, etc. indirect targets, rather than directly modeling the real data space, resulting in an inherent deviation between the model's optimization targets and anomaly detection requirements. To alleviate this problem Hirschorn et al propose an anomaly detection framework that does not rely on proxy tasks, and can improve the accuracy of the fit by directly modeling the normal data distribution. The method is completely modeled based on skeleton features, and skeleton sequences only contain body key point information, so that scene semantics, environment structures and interaction clues between people and scenes are lacked. Video anomaly detection essentially belongs to the task of highly dependent scenes, and under the condition that background information is invisible, the model cannot sense scene semantic difference, misjudgment easily occurs under the condition of crossing scenes, and a fuzzy sample is difficult to distinguish more carefully according to the scene security level. Therefore, a method for coping with the discrimination deviation caused by the semantic loss of the scene is needed at present to relieve the cross-scene misdiscrimination, improve the discrimination capability of the fuzzy sample, and improve the stability and the reliability of