CN-121838023-B - Video anomaly detection method and system based on semantic and amplitude depth synergy

CN121838023BCN 121838023 BCN121838023 BCN 121838023BCN-121838023-B

Abstract

The invention provides a video anomaly detection method and a system based on semantic and amplitude depth synergy, wherein the method comprises the steps of extracting a visual feature sequence of a video frame, utilizing amplitude and energy of each time step to strengthen feature vectors of corresponding time steps in the visual feature sequence to obtain an enhanced feature sequence, calculating a semantic similarity matrix of the enhanced feature sequence, calculating a sparse affinity matrix according to the amplitude of each time step in the visual feature sequence, fusing to obtain final attention weight, utilizing the final attention weight to conduct weighted aggregation on the value vectors of the enhanced feature sequence, and inputting the weighted aggregation into a classifier to obtain a frame-level anomaly score so as to realize video anomaly detection. The invention realizes the deep synergy of the semantic and the amplitude information forcefully on the model architecture level through one-time non-parameter characteristic engineering, and solves the problem of insensitivity to the characteristic amplitude in the existing weak supervision video anomaly detection method.

Inventors

LIAO YISHEN
XIE CHEN
LUO KAIWEN
LV YIQUN
DONG XINRU
ZHU TAO
LI SHIYU
LIU YUE
CHENG YUHENG
TU XINYI

Assignees

江西财经大学

Dates

Publication Date: 20260512
Application Date: 20260312

Claims (6)

1. The video anomaly detection method based on the synergy of semantics and amplitude depth is characterized by comprising the following steps: step1, extracting a visual characteristic sequence of a video frame; Step 2, calculating the amplitude and energy of each time step of the visual feature sequence, and calculating the amplitude of each time step of the visual feature sequence, wherein the corresponding process has the following relation: ; Wherein, the A feature vector representing time t in the visual feature sequence, The L2 norm is represented by the number, Representation of Is a magnitude of (a); the energy of each time step of the visual characteristic sequence is calculated, and the corresponding process has the following relation: ; Wherein, the Representing the square of the L2 norm, Representation of Energy of (2); enhancing the feature vector of the corresponding time step in the visual feature sequence by using the amplitude value and the energy of each time step to obtain an enhanced feature sequence; step 3, calculating a semantic similarity matrix of the enhanced feature sequence to obtain a semantic stream, and calculating a sparse affinity matrix according to the amplitude of each time step in the visual feature sequence to obtain an amplitude stream; extracting amplitude vectors of all time steps in a visual feature sequence, and normalizing to obtain normalized amplitude vectors; Calculating the amplitude affinity between any two time-step normalized amplitude vectors by adopting an exponential function The corresponding process has the following relation: ; Wherein, the Representing the normalized magnitude vector of the amplitude signal, Representing the normalized amplitude of time step i, Representing the normalized amplitude of time step j, Indicating a hyper-parameter controlling the width of the kernel, Representing an amplitude affinity matrix, A time step index is represented and is used to indicate, Representing the magnitude affinity value between time steps i and j; Representing an exponential function; Affinity for amplitude Sparse and weighting are carried out to obtain a sparse affinity matrix, and the corresponding process has the following relational expression: ; Wherein, the A sparse matrix of affinities is represented, Representing two different learnable scalars respectively, Representing a fixed scalar quantity, Representing a binary sparse mask generated by a Top-K operation, Representing a Hadamard product; Taking a sparse affinity matrix as an amplitude flow; fusing the amplitude flow as a structured priori with the semantic flow to obtain a final attention weight; The method specifically comprises the steps of carrying out additive fusion on semantic streams and amplitude streams in logits space to obtain a fused attention logits matrix, wherein the corresponding process has the following relation: ; Wherein, the Representing the post-fusion attention logits matrix, Representing a semantic similarity matrix, and taking the semantic similarity matrix as a semantic stream; Introducing a dynamic local mask, performing masking operation on the fused attention logits matrix by using the dynamic local mask, and performing Softmax calculation to obtain a final attention weight, wherein the corresponding process has the following relation: ; Wherein, the Representing the final attention weight of the person, Representing a dynamic partial mask of the image, Representing a normalized exponential function; and 4, carrying out weighted aggregation on the value vector of the enhanced feature sequence by utilizing the final attention weight, and inputting the value vector into a classifier to obtain a frame-level anomaly score so as to realize video anomaly detection.
2. The video anomaly detection method based on semantic and amplitude depth synergy according to claim 1, wherein in the step 2, the feature vector of the corresponding time step in the video feature sequence is enhanced by using the amplitude and the energy of each time step, so as to obtain an enhanced feature sequence, and the corresponding process has the following relation: ; Wherein, the Representing enhancement feature vectors, wherein a plurality of enhancement feature vectors form an enhancement feature sequence; representing feature dimension stitching operations.
3. The video anomaly detection method based on semantic and amplitude depth synergy according to claim 2, wherein in the step 3, a semantic similarity matrix of the enhanced feature sequence is calculated to obtain a semantic stream, and specifically comprising the following steps: The enhanced feature sequence is linearly projected to generate a query matrix, a key matrix and a value matrix, and the corresponding process has the following relation: ; Wherein, the Representing a query matrix; Representing a key matrix; Representing a matrix of values; Respectively representing a query matrix, a key matrix and a leachable projection matrix corresponding to the value matrix; Performing scaling dot product attention calculation on the query matrix and the key matrix to obtain a semantic similarity matrix, wherein the corresponding process has the following relational expression: ; Wherein, the Representing an enhanced feature sequence; Representing a transpose operation; The semantic similarity matrix is used as a semantic stream.
4. The video anomaly detection method based on semantic and amplitude depth synergy according to claim 3, wherein in the step 4, the final attention weight is used to perform weighted aggregation on the value vector of the enhanced feature sequence, and then the value vector is input into a classifier to obtain a frame-level anomaly score, so as to realize video anomaly detection specifically comprises the following steps: And weighting the value matrix by utilizing the final attention weight to obtain a context aggregation result, wherein the corresponding process has the following relational expression: ; Wherein, the Representing a context aggregation result; And carrying out back projection operation on the context aggregation result to obtain a back projection diagram, wherein the corresponding process has the following relational expression: ; Wherein, the A reverse projection view is shown and is shown, A back-projection operation is indicated and, Represents the Fr Luo Beini Us norm, Representing a very small positive number; Residual connection is carried out on the reverse projection graph and the enhancement feature sequence, a final context aggregation result is obtained, and the corresponding process has the following relation: ; Wherein, the Representing the final context aggregation result; Inputting the final context aggregation result into an MLP classification head to obtain the abnormal score of each time step, and performing nonlinear activation on the abnormal score of each time step to obtain the final abnormal probability, wherein the corresponding process has the following relation: ; Wherein, the Representing the anomaly score for time step t, The current time step is indicated and the current time step, The MLP classification header is shown as such, Indicating the final probability of an anomaly, Representing a Sigmoid function.
5. The video anomaly detection method based on semantic and amplitude depth synergy according to claim 4, wherein the steps 1 to 4 are implemented by a video anomaly detection model in execution, and the process of constructing the video anomaly detection model comprises the following steps: Constructing a basic model based on a pre-trained visual model, a nonlinear space lifting layer, a double-flow fusion attention layer and a classification layer, wherein the pre-trained visual model is used for acquiring a visual characteristic sequence of a video frame, the nonlinear space lifting layer is used for acquiring an enhanced characteristic sequence, the double-flow fusion attention layer is used for acquiring a final attention weight, and the classification layer is used for acquiring a frame-level anomaly score; giving a training set, wherein the training set comprises training videos and corresponding video-level real labels; repeating the steps 1 to 4 by taking the training video as input to obtain frame level probability; aggregating the frame level probabilities through the maximum pooling operation, and then adopting nonlinear activation to obtain the prediction probability of the video level; Based on binary cross entropy, constructing a loss function by adopting the prediction probability of the video level and the corresponding video level real label, wherein the corresponding process has the following relation: ; Wherein, the The value of the loss function is indicated, Representing the real labels at the video level, ; Representing the probability of prediction at the video level, , Representing taking a maximum value in the time dimension; And setting a weight attenuation strategy and learning parameters, and updating the weights and the learning parameters corresponding to the feature fusion layer and the classification layer to minimize loss so as to train the basic model, and obtaining the video anomaly detection model after training is completed.
6. A video anomaly detection system based on semantic and amplitude depth coordination, wherein the system applies the video anomaly detection method based on semantic and amplitude depth coordination according to any one of claims 1 to 5, the system comprising: The feature extraction module is used for: Extracting a visual feature sequence of a video frame; a nonlinear space lifting module for: calculating the amplitude and energy of each time step of the visual feature sequence; enhancing the feature vector of the corresponding time step in the visual feature sequence by using the amplitude value and the energy of each time step to obtain an enhanced feature sequence; A dual stream fusion attention module for: Calculating a sparse affinity matrix according to the amplitude value of each time step in the visual feature sequence to obtain an amplitude value stream; fusing the amplitude flow as a structured priori with the semantic flow to obtain a final attention weight; An anomaly scoring module for: And carrying out weighted aggregation on the value vector of the enhanced feature sequence by utilizing the final attention weight, and inputting the value vector into a classifier to obtain a frame-level anomaly score so as to realize video anomaly detection.

Description

Video anomaly detection method and system based on semantic and amplitude depth synergy Technical Field The invention belongs to the technical field of computer vision, and particularly relates to a video anomaly detection method and system based on semantic and amplitude depth synergy. Background Weak Surveillance Video Anomaly Detection (WSVAD) aims to locate specific anomaly frames using only video-level tags (i.e., whether anomalies are contained in video) is a critical and challenging task in the field of video understanding. The prior art can be mainly divided into several categories: (1) Such methods assume that the normal event mode is single and easy to learn, while the unusual event is difficult to reconstruct or predict by the model due to its rarity, thereby generating a large error. However, the strong generalization ability of deep learning models often also reconstructs the outliers well, resulting in insufficiently significant error differences between the outliers and the normals. (2) The method based on strong priori knowledge introduces stronger priori knowledge for enhancing discrimination in the subsequent work. For example, RTFM and like methods demonstrate that feature magnitudes are strong signals to detect anomalies, but they typically direct the model attention to magnitudes from outside by designing elaborate loss functions, without altering the perception mechanisms inside the model. Methods based on graph attention network (GATs) are good at capturing the contextual relationships between video frames, but their logic to construct a graph is almost entirely dependent on the directional consistency between features (i.e., semantic similarity), ignoring the variation in feature strength (magnitude), and therefore difficult to handle events that are "semantically normal, but abnormal due to strength or contextual inconsistencies". In summary, the prior art generally has a fundamental bottleneck, namely, a standard attention or graph construction mechanism, the core of which is to evaluate semantic similarity, but is not sensitive enough to the change of characteristic amplitude. This results in models that have difficulty capturing critical events that are semantically similar to normal events, but are abnormal due to context or intensity spikes. Therefore, a new method for deeply coordinating semantic information with amplitude information at the model architecture level is needed. Disclosure of Invention In view of the above situation, the main objective of the present invention is to provide a method and a system for detecting video anomalies based on semantic and amplitude depth synergy, so as to solve the technical problem that in the prior art, the attention mechanism is only sensitive to semantic similarity and ignores the feature amplitude, which results in difficulty in distinguishing context-related anomalies. The invention provides a video anomaly detection method based on semantic and amplitude depth synergy, which comprises the following steps: step1, extracting a visual characteristic sequence of a video frame; Step 2, calculating the amplitude and energy of each time step of the visual characteristic sequence; enhancing the feature vector of the corresponding time step in the visual feature sequence by using the amplitude value and the energy of each time step to obtain an enhanced feature sequence; step 3, calculating a semantic similarity matrix of the enhanced feature sequence to obtain a semantic stream, and calculating a sparse affinity matrix according to the amplitude of each time step in the visual feature sequence to obtain an amplitude stream; fusing the amplitude flow as a structured priori with the semantic flow to obtain a final attention weight; and 4, carrying out weighted aggregation on the value vector of the enhanced feature sequence by utilizing the final attention weight, and inputting the value vector into a classifier to obtain a frame-level anomaly score so as to realize video anomaly detection. The invention also provides a video anomaly detection system based on the cooperation of the semantics and the amplitude depth, wherein the system applies the video anomaly detection method based on the cooperation of the semantics and the amplitude depth, and the system comprises the following steps: The feature extraction module is used for: Extracting a visual feature sequence of a video frame; a nonlinear space lifting module for: calculating the amplitude and energy of each time step of the visual feature sequence; enhancing the feature vector of the corresponding time step in the visual feature sequence by using the amplitude value and the energy of each time step to obtain an enhanced feature sequence; A dual stream fusion attention module for: Calculating a sparse affinity matrix according to the amplitude value of each time step in the visual feature sequence to obtain an amplitude value stream; fusing the amplitude flow as a structured priori w