CN-121982780-A - Weak supervision group behavior identification method and system based on double-branch space-time motion fusion network
Abstract
A weak supervision group behavior recognition method and system based on a double-branch space-time motion fusion network comprises the steps of collecting a plurality of videos containing group behaviors, preprocessing the videos to obtain a plurality of continuous video frame sequences, constructing a training data set by using the continuous video frame sequences, constructing the double-branch space-time motion fusion network comprising a global motion dynamic module, a backbone network, a space-time fusion module and a prediction head, training the double-branch space-time motion fusion network by using the training data set, constructing a total loss function, adjusting the double-branch space-time motion fusion network according to the total loss function to obtain a trained double-branch space-time motion fusion network, and deploying the trained double-branch space-time motion fusion network to a device side. The invention solves the problem of processing camera motion interference in complex scenes in group behavior recognition, and has high group behavior recognition precision and recognition efficiency.
Inventors
- ZHU XIAOLIN
- ZHANG XIAO
- WAN QIN
- LUO ZHENYUE
- XIE PEIJUN
- WANG WEIXIANG
- TANG CAN
- Deng Taiguo
Assignees
- 湖南工程学院
Dates
- Publication Date
- 20260505
- Application Date
- 20260403
Claims (10)
- 1. A weak supervision group behavior identification method based on a double-branch space-time motion fusion network is characterized by comprising the following steps: s1, acquiring a plurality of videos containing group behaviors, preprocessing the videos to obtain a plurality of continuous video frame sequences, and constructing a training data set by utilizing the plurality of continuous video frame sequences; S2, constructing a double-branch space-time motion fusion network, which comprises a global motion dynamic module GMDM, a backbone network, a space-time fusion module STFM and a prediction head, wherein the output ends of the global motion dynamic module GMDM and the backbone network are respectively connected with the input end of the space-time fusion module STFM, and the output end of the space-time fusion module STFM is connected with the prediction head; S3, training the double-branch space-time motion fusion network by using a training data set, constructing a total loss function, and adjusting the double-branch space-time motion fusion network according to the total loss function to obtain a trained double-branch space-time motion fusion network; and S4, deploying the trained double-branch space-time motion fusion network to the equipment end, and carrying out group behavior recognition to obtain a recognition result.
- 2. The method for identifying weak supervision group behavior based on a dual-branch spatio-temporal motion fusion network according to claim 1, wherein the global motion dynamic module GMDM comprises an adaptive time sequence position encoder and a higher-order motion statistics module HoMST which are sequentially connected, wherein the higher-order motion statistics module HoMST is designed based on a transducer model; The high-order motion statistics module HoMST comprises a Token input layer, a first normalization layer, a HoMST submodule, a second normalization layer, a feedforward neural network and a Token output layer, wherein the output end of the Token input layer is divided into two branches, one branch of the Token input layer is sequentially connected with the first normalization layer and the HoMST submodule in series, the second branch of the Token input layer is connected with the output end of the HoMST submodule through residual errors and then is divided into two branches, namely a first branch and a second branch, the first branch is sequentially connected with the second normalization layer and the feedforward neural network in series, and the output end of the second branch is connected with the Token output layer after being connected with the output end of the feedforward neural network through the residual errors; The HoMST submodule comprises a plurality of Token statistical self-attention mechanisms TSSA and a soft matrix member characteristic aggregation module, wherein the Token statistical self-attention mechanisms TSSA are respectively connected with the soft matrix member characteristic aggregation module.
- 3. The method for identifying weak supervision group behaviors based on the dual-branch space-time motion fusion network according to claim 1, wherein the space-time fusion module STFM comprises two first convolution layers, two channel attention mechanisms CA, a cross-domain fusion block CFB, a correlation enhancement module CE, a channel dimension splicing module and a feature fusion module FFB; One first convolution layer is sequentially connected with one channel attention mechanism CA and a cross-domain fusion block CFB in series, the other first convolution layer is sequentially connected with the other channel attention mechanism CA and a correlation enhancement module CE in series, the two channel attention mechanisms CA are both connected with the cross-domain fusion block CFB and the correlation enhancement module CE, the two channel attention mechanisms CA, the cross-domain fusion block CFB and the correlation enhancement module CE are respectively connected with a channel dimension splicing module, and the output end of the channel dimension splicing module is connected with a feature fusion module FFB; the feature fusion module FFB comprises an inverse depth separable convolution module, a depth separable convolution module and a projection convolution module which are sequentially connected in series.
- 4. The method for identifying weak supervision group behaviors based on the dual-branch space-time motion fusion network according to claim 3, wherein the step S3 specifically comprises the following steps: S31, inputting a training data set into a global motion dynamic module GMDM in a double-branch space-time motion fusion network, firstly obtaining feature vectors of three fields of an acceleration field, a transition field and a pulse field through DeepFlow algorithm, and then splicing to obtain a unified motion feature map tensor ; Tensor of unified motion characteristic map Flattening to obtain Token matrix Then Token matrix Input into an adaptive time sequence position encoder to obtain a Token matrix embedded with the adaptive time sequence position code ; S32, a Token matrix embedded with adaptive time sequence position codes Input into a high-order motion statistics module HoMST to obtain a motion characteristic diagram ; S33, inputting the training data set into a backbone network to obtain a feature map ; S34, mapping the characteristic diagram And a motion profile Is input into a space-time fusion module STFM together, and output to obtain fusion characteristics ; S35, fusing the features Inputting the probability into a prediction head to obtain predicted corresponding category probability; s36, constructing a total loss function according to the predicted corresponding category probability, and adjusting parameters of the double-branch space-time motion fusion network according to the total loss function; And S37, judging whether the set iteration stop condition is met, if so, outputting the trained double-branch space-time motion fusion network, otherwise, returning to S31, and training again.
- 5. The method for identifying weak supervision group behavior based on the dual-branch spatio-temporal motion fusion network according to claim 4, wherein said S31 specifically comprises the following steps: S311, inputting the training data set into a global motion dynamic module GMDM in a double-branch space-time motion fusion network, and firstly estimating to obtain a characteristic vector of a speed field through DeepFlow algorithm Then, the characteristic vector of the acceleration field is estimated by the difference of adjacent velocity fields ; S312, obtaining the characteristic vector of the transition field by adopting a differential acceleration method and through the difference estimation of adjacent acceleration fields ; S313, obtaining the characteristic vector of the pulse field through adjacent transition field estimation ; S314, splicing the eigenvectors of the acceleration field, the transition field and the pulse field to obtain a unified motion characteristic map tensor Wherein Tz is the number of frames in the video clip; h and W respectively represent the height and width of the video frame; representing a real set; s315, tensor of unified motion characteristic map Flattening to obtain Token matrix Wherein For the stitching dimension of the higher order motion statistic, For Token matrix The number of Token features in the matrix, and then matrix the Token Input into an adaptive time sequence position encoder to obtain a Token matrix embedded with the adaptive time sequence position code 。
- 6. The method for recognizing weakly supervised group behavior based on dual branch spatio-temporal motion fusion network as set forth in claim 5, wherein the feature vector of the acceleration field in S311 is as follows The expression of (2) is as follows: ; Wherein, the Representing slave time Time to date Is used to determine the characteristic vector of the velocity field of (c), Representing a two-dimensional real vector space; Representing slave time Time to date Is a feature vector of the velocity field of (a); Representing a time difference; The feature vector of the transition field in S312 The expression of (2) is as follows: ; Wherein, the And Respectively represent slave times Time to date Time and duration Time to date Is a characteristic vector of the acceleration field of (a); In S313, feature vectors of the pulse field The expression of (2) is as follows: ; Wherein, the And Respectively represent slave times Time to date Time and duration Time to date Is used for the transition field of the transition field.
- 7. The method for identifying weak supervision group behavior based on the dual-branch spatio-temporal motion fusion network according to claim 6, wherein said S34 specifically comprises the steps of: S341, mapping the characteristic diagram And a motion profile Respectively inputting the two feature representations into a space-time fusion module STFM, respectively, and outputting feature representations with uniform dimensions after passing through two first convolution layers 、 ; S342, representing the characteristics 、 Respectively inputting into two channel attention mechanisms CA, performing global average pooling along space-time dimension to obtain channel level statistical descriptor, and generating channel weight vector via bottleneck structure composed of two fully connected layers 、 ; S343, representing the characteristics 、 Respectively with channel weight vectors 、 Obtaining recalibrated features by channel-level multiplication operations 、 ; S344, feature 、 Input into a cross-domain fusion block CFB, features are convolved with depth separation 、 Converted into triples q, k and v to obtain two triples, namely triples And triplets of According to triplets Triplet(s) Calculating bidirectional attention 、 To pay attention to both directions 、 After channel dimension splicing, dimension reduction is carried out through 1X 1 convolution, and characteristics fusing cross-domain semantics are obtained ; Wherein, the Respectively represent characteristics of The converted query vector, key vector, and value vector; respectively represent characteristics of The converted query vector, key vector, and value vector; D represents a dimension hidden space; s345, feature 、 Inputting to a medium correlation enhancement module CE, performing matrix multiplication to obtain a correlation matrix ; S346 feature to be recalibrated 、 And features that fuse cross-domain semantics Correlation matrix Splicing in the channel dimension to obtain splicing characteristics ; S347, to splice features Inputting the fusion characteristics into a characteristic fusion module FFB to obtain fusion characteristics 。
- 8. The method for identifying weak supervision group behavior based on the dual-branch spatio-temporal motion fusion network according to claim 7, wherein the channel weight vector in S342 is as follows 、 The expressions of (2) are as follows: ; ; Wherein, the And Respectively a dimension reduction weight matrix and a dimension increase weight matrix, r is the dimension reduction proportion, Representing the function of the ReLU activation, Operating for Sigmoid; representing a global average pooling operation; Features in the S343 、 The expressions of (2) are as follows: ; ; Wherein, the Representing a channel level multiplication operation; the triplet in S344 The expressions of (2) are as follows: , , ; Wherein, the Representing a depth separable convolutional layer for generating a query vector; representing a depth separable convolutional layer for generating a key vector; Representing a depth separable convolutional layer for generating a value vector; the triplet in S344 The expressions of (2) are as follows: , , ; The two-way attention in S344 、 The expressions of (2) are as follows: ; ; Wherein, the Representing a normalized exponential function; is the dimension of the key vector, is used for scaling to prevent gradient explosion, and is denoted by T; The correlation matrix in S345 The expression of (2) is as follows: ; Splice features in S346 The expression of (2) is as follows: ; Wherein, the Representing a splicing operation; Fusion features in S347 The expression of (2) is as follows: ; Wherein, the Representing an inverse depth separable convolution; representing a depth separable convolution; Representing an activation function.
- 9. The method for identifying the weak supervision group behavior based on the double-branch space-time motion fusion network according to claim 8, wherein the expression of the total loss function in S3 is specifically as follows: ; Wherein, the Representing the total loss; representing cross entropy loss; representing attention entropy regularization loss; Represents a theoretical constraint loss based on MCR 2; 、 respectively representing weight coefficients of the corresponding losses; Cross entropy loss The expression of (2) is as follows: ; Wherein, the Representing the value of the real label of the i-th sample on the c-th class, The corresponding class probability predicted for the dual-branch spatio-temporal motion fusion network, B is the total number of video samples, The total number of the group behavior categories; The expression of (2) is as follows: ; Wherein, the A home probability vector representing the j-th Token feature for each attention header, The probability value assigned to the kth attention header for the jth Token feature, Is a numerical stability constant; representing an entropy function; The expression of (2) is as follows: ; wherein, K is the number of attention heads in the Token statistics self-attention mechanism TSSA, K represents the kth attention head; Is a unit matrix of p-dimension, Representing a matrix determinant; is the orthographic projection base of the kth attention head; Representing soft allocation matrix Is a column vector of (2); Is a normalization factor; To convert vectors into linear algebraic operators of diagonal matrices.
- 10. A weak supervision group behavior recognition system based on a dual-branch spatiotemporal motion fusion network, characterized in that a weak supervision group behavior recognition method based on a dual-branch spatiotemporal motion fusion network as described in any one of claims 1 to 9 is performed.
Description
Weak supervision group behavior identification method and system based on double-branch space-time motion fusion network Technical Field The invention relates to the technical field of image recognition, in particular to a weak supervision group behavior recognition method and system based on a double-branch space-time motion fusion network. Background The core goal of group behavior recognition is to infer the group activity type of the participants based on key group cues in the scene. The task not only has wide application in the fields of sports video analysis, monitoring systems, social scene understanding and the like, but also gradually becomes one of the hot spots for research along with the improvement of intelligent and automatic demands. Unlike traditional action behavior recognition, which only focuses on individual behaviors, group behaviors also need to deal with complex space-time relationships among multiple actors in a group, and need to fully understand the whole scene. This means that group behavior recognition faces more challenges, especially how to efficiently model and capture interactions inside the group and with the scene. Existing group behavior recognition methods typically rely on individual level and bounding box annotations to model spatiotemporal interactions between participants. Specifically, individual features within a bounding box in an image are typically extracted by ROIAlign techniques, and then the spatiotemporal relationship between individuals is modeled using a model such as a Recurrent Neural Network (RNN), a Graph Neural Network (GNN), or a transducer. Although these approaches solve the problem of space-time dependent modeling to some extent, significant limitations remain. First, relying on labeled bounding boxes is time consuming and labor intensive, requiring not only a significant amount of manual intervention in the labeling process, but also low labeling efficiency on large-scale datasets. Secondly, these methods rely too much on the output of the detector, which may suffer from false alarms or missed detection, affecting the final behavior recognition accuracy. In particular, in some complex scenarios, errors in bounding box detection may lead to distortion of recognition results. To solve the above problems, kim et al propose a model without detectors to locate and encode part of the context information by means of the attention mechanism in the transducer model, thereby capturing key people and objects in the group behaviour. This approach no longer relies on traditional bounding box labeling, attempting to dynamically capture key actors and objects in the population through global context information. However, this approach is limited in that it explores the motion features in an activity by computing only local correlations between adjacent frames in a video sequence. The local feature extraction mode ignores the influence of operations such as focusing, zooming and large-amplitude movement aiming at key individuals in the shooting process of a camera on motion features. Thus, while this approach works well in some scenarios, it tends to be less effective in complex dynamic environments, especially when camera motion disturbances are involved. Disclosure of Invention The invention provides a weak supervision group behavior identification method and system based on a double-branch space-time motion fusion network, which are used for solving the technical problems mentioned in the background art. In order to achieve the above purpose, the technical scheme of the invention is realized as follows: The invention provides a weak supervision group behavior identification method based on a double-branch space-time motion fusion network, which comprises the following steps: s1, acquiring a plurality of videos containing group behaviors, preprocessing the videos to obtain a plurality of continuous video frame sequences, and constructing a training data set by utilizing the plurality of continuous video frame sequences; S2, constructing a double-branch space-time motion fusion network, which comprises a global motion dynamic module GMDM, a backbone network, a space-time fusion module STFM and a prediction head, wherein the output ends of the global motion dynamic module GMDM and the backbone network are respectively connected with the input end of the space-time fusion module STFM, and the output end of the space-time fusion module STFM is connected with the prediction head; S3, training the double-branch space-time motion fusion network by using a training data set, constructing a total loss function, and adjusting the double-branch space-time motion fusion network according to the total loss function to obtain a trained double-branch space-time motion fusion network; and S4, deploying the trained double-branch space-time motion fusion network to the equipment end, and carrying out group behavior recognition to obtain a recognition result. Further, the global