CN-121982603-A - Video anomaly detection method based on multitask learning
Abstract
The invention discloses a video anomaly detection method based on multi-task learning, wherein a model comprises a space-time encoder, a prototype memory network, a double decoder architecture and a reconstruction decoder, and the method comprises the steps of extracting layering characteristics of a training video frame sequence by using the space-time encoder; the method comprises the steps of carrying out memory enhancement processing on space-time characteristics through a prototype memory network, respectively carrying out future frame prediction and current frame reconstruction through a double decoder architecture, calculating a multi-task loss function, updating parameters of a space-time encoder, the prototype memory network and the double decoder through back propagation according to the multi-task loss function, and constructing a more comprehensive self-supervision learning framework to enhance generalization capability of a model.
Inventors
- LIU YANG
- ZENG XINHUA
- YANG HAO
- LIU JING
Assignees
- 杭州慧视诺宝智能科技有限公司
- 复旦大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260116
Claims (10)
- 1. The video anomaly detection method based on the multi-task learning is characterized in that the anomaly detection model comprises the following steps: a spatiotemporal transform encoder extracting layered spatiotemporal features from the input video frame sequence; The prototype memory network is connected with the encoder and used for storing a normal mode prototype and carrying out memory enhancement on coding characteristics through similarity matching; A dual decoder architecture connecting the encoder and the prototype memory network, including a predictive decoder and a reconstruction decoder; the predictive decoder performs a future frame prediction task based on the original temporal characteristics output by the encoder; and the reconstruction decoder is used for executing the reconstruction task of the current frame based on the original time characteristic and the memory enhancement characteristic output by the prototype memory network.
- 2. The method for detecting video anomalies based on multi-task learning according to claim 1, wherein the spatio-temporal transform encoder comprises a spatial encoder that processes single-frame images to extract spatial features and a temporal encoder that processes successive frame sequences to extract temporal features.
- 3. The method for detecting video anomalies based on multi-tasking as recited in claim 2, wherein the spatial encoder includes a patch embedding module that embeds a sequence of N consecutive video frames ; The spatial encoder comprises the following processing steps: S1.1.1 patch embedding for each frame of image input Dividing the image into P non-overlapped image blocks with the size of P multiplied by P for embedding; s1.1.2 adding a learnable spatial classification token to each patch sequence And position coding, namely, initial sequence is processed Input to a slave The modules are formed by stacking identical space transducer coding layers; S1.1.3 spatial feature vector By passing through The individual spatial transducer layer processes the captured intra spatial dependencies.
- 4. A video anomaly detection method based on multi-tasking according to claim 3, wherein the processing steps of the temporal encoder are as follows: S1.2.1 collecting the spatial CLS tokens of all video frames, arranging in time sequence to construct an initial time sequence ; S1.2.2 adding a time-sorting token at the beginning of the time sequence Aggregating global context information for the entire segment in a time dimension; generating a position-coding vector for each position in the time series Finally, obtaining the time sequence initial representation fused with the time sequence position information; S1.2.3 initial time sequence Input to a slave In the module formed by stacking the time transducer coding layers, the acquired time sequence is processed The individual time transducer layers model inter-frame time dynamics.
- 5. A method of video anomaly detection based on multiple learning as claimed in claim 3 wherein the prototype memory network maintains a learnable memory pool containing M prototype features Wherein each prototype Representing a characteristic normal mode; the specific steps of the prototype memory network for memory enhancement are as follows: s2.1 judging input characteristics by memory addressing Which normal pattern prototype or prototypes in the memory are most similar; s2.2, in the training process, updating the memory bank by using an index moving average: In the following Is the update rate of the data to be updated, Is an indicator function that selects the prototype with the highest attention weight; S2.3, encouraging sparse memory use and preventing excessive smoothing by introducing memory shrinkage loss in combination with entropy regularization and diversity constraints.
- 6. The method for detecting video anomalies based on multi-task learning according to claim 5, wherein S2.1 comprises the following specific steps: S2.1.1 calculating attention weights between the coding time features and the stored prototypes based on the similarity; s2.1.2 given coding temporal characteristics Attention weights were calculated using softmax normalization with temperature scaling; s2.1.3 performing a memory search based on the addressing result, reading the relevant normal mode information from the memory bank, the searched memory representation being calculated as a weighted combination of all prototypes.
- 7. The method for detecting video anomalies based on multi-tasking as recited in claim 6, wherein the dual decoder architecture passes raw temporal features and memory enhancement features Performing a complementary task in which the predictive decoder directly processes the original temporal feature f to predict the next frame through the upsampling layer A reconstruction decoder reconstructs an input frame by stitching the original and memory enhancement features together Allowing the reconstruction task to benefit from the memory constraint representation while the prediction task maintains direct access to the temporal dynamics.
- 8. The method for video anomaly detection based on multi-task learning of claim 7 wherein the loss functions of the plurality of tasks are integrated into a total training loss by means of weighted summation during the model training phase to coordinate the optimization process and balance the contribution of each task to model parameters.
- 9. The video anomaly detection method based on multi-task learning of claim 7, wherein the anomaly detection operation is further performed during decoding: Firstly, calculating PSNR values of prediction and reconstruction tasks; Then, measuring the concentration degree of the attention weight through a memory entropy, and obtaining an intermediate anomaly score through the combination of the intermediate anomaly score and a normalized PSNR value and the memory entropy; the final anomaly score is normalized to the range [0,1].
- 10. A video anomaly detection method based on multi-task learning according to any one of claims 1 to 8, comprising a model training phase and an anomaly detection phase, the model training phase comprising the steps of: S1, carrying out hierarchical feature extraction on the training video frame sequence by using a time-space transducer encoder to obtain time-space features; s2, carrying out memory enhancement processing on the space-time characteristics through a prototype memory network; S3, respectively carrying out future frame prediction and current frame reconstruction through a double decoder architecture; S4, calculating a multi-task loss function, wherein the multi-task loss function comprises a prediction loss, a reconstruction loss and a memory shrinkage loss; S5, updating parameters of a space-time transducer encoder, a prototype memory network and a double decoder through back propagation according to the multi-task loss function; The abnormality detection stage includes the steps of: 1) Acquiring a video frame sequence to be detected; 2) Extracting spatiotemporal features from the sequence of video frames using the spatiotemporal transform encoder; 3) The time-space characteristics are subjected to memory addressing through the prototype memory network, and attention weights are obtained; 4) Outputting a future frame and a reconstructed frame using the double decoder, respectively; 5) Calculating a composite anomaly score based on the future frame, reconstructed frame, and attention weight; 6) And judging whether an abnormal event exists in the video according to the comprehensive abnormal score.
Description
Video anomaly detection method based on multitask learning Technical Field The invention relates to a video anomaly detection model based on multi-task learning, and also relates to a memory enhancement method based on multi-task learning for video anomaly detection, belonging to the technical field of image processing. Background Video anomaly detection is one of core technologies in the fields of computer vision and intelligent monitoring, and aims to automatically identify events in video, such as traffic accidents, violent behaviors, intrusion into forbidden areas and the like, which do not accord with normal behavior modes. With the wide deployment of monitoring cameras in public places, the automatic anomaly detection technology has important application value in public safety, traffic management, industrial monitoring and other aspects. The existing video anomaly detection method is mainly divided into two major categories, namely a weak supervision method and an unsupervised method. The weakly supervised approach relies on video-level tag information, while performing well in some scenarios, is limited by the scarcity and high cost of annotation data. Thus, the unsupervised approach becomes a research hotspot, which is trained using only normal samples, identifying abnormal events that deviate from the normal pattern by modeling the pattern. Typical unsupervised methods include reconstruction-based methods such as self-encoders that reconstruct normal samples by learning to generate large reconstruction errors for abnormal samples, prediction-based methods that predict future frames using a temporal model, the abnormal samples resulting in increased prediction errors, and memory-based methods such as memory-enhanced self-encoders that store normal pattern prototypes through a prototype memory network to limit the reconstruction ability of the model for the abnormal samples. In recent years, the transducer architecture has been introduced into video anomaly detection tasks due to its powerful sequence modeling capabilities. For example, transAnomaly and HSTforU approaches have advanced to space-time feature modeling using VisionTransformer or variants thereof. However, these methods still have the following problems: 1. the method comprises the steps of (1) converting a normal sample into a transient sample, wherein the transient sample has strong capability when representing the abnormal sample, so that the false alarm rate of the normal sample is increased, 2, the task is single, most methods only adopt one task in reconstruction or prediction, the complementarity of space-time characteristics cannot be fully utilized, 3, a constraint mechanism is insufficient, the generalization capability of a model on the abnormal sample is limited due to the lack of an effective mechanism, and 4, the characteristic is insufficiently utilized, and the detection capability of complex abnormal types is limited due to the fact that original characteristics and memory enhancement characteristics cannot be effectively combined. Disclosure of Invention The invention aims to overcome the problems in the prior art and provide a video anomaly detection method based on multi-task learning, which improves the detection precision and robustness of the original transducer architecture. In order to solve the above technical problems, the video anomaly detection method based on multi-task learning of the present invention, the anomaly detection model includes: a spatiotemporal transform encoder extracting layered spatiotemporal features from the input video frame sequence; The prototype memory network is connected with the encoder and used for storing a normal mode prototype and carrying out memory enhancement on coding characteristics through similarity matching; A dual decoder architecture connecting the encoder and the prototype memory network, including a predictive decoder and a reconstruction decoder; the predictive decoder performs a future frame prediction task based on the original temporal characteristics output by the encoder; and the reconstruction decoder is used for executing the reconstruction task of the current frame based on the original time characteristic and the memory enhancement characteristic output by the prototype memory network. Further, the space-time transform encoder includes a spatial encoder that processes single frame images to extract spatial features and a temporal encoder that processes consecutive frame sequences to extract temporal features. Further, the spatial encoder includes a patch embedding module that embeds a sequence of N consecutive video frames; The spatial encoder comprises the following processing steps: S1.1.1 patch embedding for each frame of image input Dividing the image into P non-overlapped image blocks with the size of P multiplied by P for embedding; s1.1.2 adding a learnable spatial classification token to each patch sequence And position coding, namely, initial seque