Search

CN-115546885-B - Action recognition method and system based on enhanced space-time characteristics

CN115546885BCN 115546885 BCN115546885 BCN 115546885BCN-115546885-B

Abstract

The invention discloses an action recognition method and system based on enhanced space-time characteristics, wherein the method comprises the following steps: determining a data set, acquiring input data, realizing channel-level motion information enhancement, realizing space-time feature aggregation, multi-branch output and design model training details. The method can remove redundant information and acquire information in different time spans by adopting a strategy of sparse sampling of the video, promotes beneficial information and suppresses interference information by giving weight to each feature channel of the feature map through channel-level motion information enhancement, lays a foundation for subsequent feature extraction, models space-time context information with low calculation cost through space-time feature aggregation, realizes fusion of features of adjacent frames in a self-adaptive mode, further extracts high-level local features of the video through multi-branch output, and supplements global features extracted by a single output layer of an original backbone network.

Inventors

  • XU CHAO
  • LIU XIAOCHAO
  • MENG ZHAOPENG
  • HU JING
  • XIAO JIAN

Assignees

  • 天津大学

Dates

Publication Date
20260512
Application Date
20210610

Claims (7)

  1. 1. An action recognition method based on enhanced space-time features is characterized by comprising the following steps: S1, determining a data set; s2, acquiring input data, dividing a video V into T fragments by sparse sampling, randomly sampling one frame in the T fragments, cutting the frame to a uniform size, and marking as The input resulting from a video is ultimately represented as Wherein T is a value in a time dimension, C is the number of channels, and H and W are the height and width of the frame after clipping respectively; S3, realizing channel-level motion information enhancement, wherein the method comprises the following specific steps of: S31, passing the input data through a 1*1 2D convolution layer Obtaining a characteristic diagram Wherein r=16; s32, will Dividing in the T dimension to obtain a feature map corresponding to the T frame ; S33, will Feeding a 3*3 2D convolutional layer And calculating the difference between the feature maps of two adjacent frames as the motion feature at the time t: corresponding to time T From original feature maps of the T-th frame Direct replication, all The feature map is obtained after the T dimension is spliced ; S34, using a global average pooling layer to compress global space information into a channel descriptor: wherein ; S35, will The feature passes through another 1*1 2D convolutional layer Restoring the channel number to C and The weights for each channel are calculated as an activation function, where the range of original weights is extended from 0,1 to-1, ; S36, multiplying the weight values and the input feature map according to the channel to obtain the output of the channel-level motion information enhancement module: wherein For multiplication by channel; S4, realizing space-time feature aggregation, wherein the method comprises the following specific steps of: S41, changing the feature map Is of the shape of ; S42, at A convolution kernel size 3 channel level 1D convolution is applied in the time T dimension: wherein Is of the channel type Is used to determine the convolution kernel weights of (c), ; S5, multi-branch output is carried out; S6, training details of the design model.
  2. 2. The method of claim 1, wherein in S1, a someasurement-someasurement V1 dataset is selected, comprising daily actions for interaction with common objects, V1 comprising 108499 video clips, and 174 action categories.
  3. 3. The method of claim 1, wherein in S5, the multi-branch output uses 3 classification branch outputs, and the 3 classification branch outputs are trained together to obtain a final classification result.
  4. 4. A method of motion recognition based on enhanced spatio-temporal features according to claim 3, characterized in that the 3 classification branch outputs comprise: the first output is Is a common output layer structure in the residual neural network, and the second output is By the following constitution The output layer is passed through to obtain; The third output is After the Max of the global maximum pooling layer, a branch is opened, and each N responses are naturally divided into C types after being averaged by using a cross-channel pooling layer; Wherein, the The method comprises the steps of obtaining a part with the largest response in all features for a global maximum pooling layer Max, wherein N is the number of local features extracted from each class, C is the total class of video, and an output layer consists of a 1*1 2D convolution layer, a global average pooling layer, a full connection layer and a loss function layer.
  5. 5. The method for motion recognition based on enhanced spatiotemporal features of claim 4, wherein, The specific steps of (a) are as follows: s51, performing bilinear upsampling on an output characteristic diagram of a fourth EST block in the ResNet network, and splicing the output characteristic diagram of the third EST block in a channel dimension to obtain a characteristic diagram F; S52, 2D convolutional layer Using 1*1 As a fine-granularity feature extractor, the number of convolution kernels is set to be N×C, wherein N is the number of local features extracted from each class, and C is the total class of the video; s53, acquiring a part with the largest response in all features by adopting a global Max pooling layer Max as a local feature extracted by a network , 。
  6. 6. The method of claim 4, wherein for the motion recognition based on enhanced spatiotemporal features 、 And All adopt standard cross entropy loss functions, respectively defined as , And To facilitate each output learning separately, the final loss for training is directly summed by the three loss: 。
  7. 7. an enhanced spatiotemporal feature based motion recognition system performing the enhanced spatiotemporal feature based motion recognition method of claims 1-6, comprising: the input module is used for processing the video data, performing sparse sampling on the video data, and sampling frames in different time spans; The channel-level motion information enhancement module is used for calculating motion characteristics of the feature map and giving weight to each channel according to the richness of the motion information by using the attention network; The space-time feature aggregation module is used for modeling space-time context after the motion information is enhanced, and the features of adjacent frames are fused in a self-adaptive mode; And the multi-branch output module is used for extracting high-level local characteristics from the video and supplementing global characteristics extracted from a single output layer of the original backbone network.

Description

Action recognition method and system based on enhanced space-time characteristics Technical Field The invention belongs to the technical field of computer vision and video classification, and particularly relates to an action recognition method and system based on enhanced space-time characteristics. Background The identification of actions in video is one of the most important problems in computer vision, and semantic information can be extracted from video, so that the method has rich application prospects, such as patient monitoring, motion analysis, intelligent video monitoring, human-computer interaction and the like. At the same time, the semantic information extracted by the action recognition can also provide video features for other computer vision tasks (such as action detection and positioning). Because video is essentially formed by stacking multiple images in the time dimension, video has a natural link to images, and early motion recognition methods have been generally modified and extended by image recognition methods, such as shallow high-dimensional coding based on local spatio-temporal features, including HOG3D, SIFT-3D. With the rapid development of deep learning, CNN achieves a good effect in the field of image recognition, but the recognition result of the motion recognition field by directly applying 2D CNN to video is not good, and when 2D CNN is used to directly perform video recognition, each frame of the input video can only be classified and the classification result of all frames can be integrated, but the time information is ignored, and the motion characteristics contained between frames in the time dimension are critical to video classification. In order to extract the motion features in the time dimension, the 2D network usually adopts a dual-stream structure, two streams extract the spatial features and the temporal features respectively, the spatial branches take RGB frames as input, the temporal branches take optical streams as input, because the optical streams are defined by instantaneous displacement vectors of pixels, contain motion information, and finally the classification scores of the two network branches are fused to obtain a final classification result, but once the optical streams are used, additional challenges of offline calculation and storage space are required, and the network structure cannot realize end-to-end, which is also a disadvantage of the dual-stream structure. In order to achieve end-to-end extraction of time dimension features, 3D convolution methods have been proposed by scholars. Unlike 2D convolution, 3D convolution requires stacked video frames as input from which spatio-temporal features are then extracted directly using 3D convolution kernels. Although extracting spatio-temporal features using 3D convolution is quite natural and easy to understand, the expansion of the convolution kernel from 2D to 3D increases the amount of computation, and requires higher computational resources than 2D convolution. In addition to convolution, motion information extraction modules may also be utilized to obtain motion characteristics. The shift operation is more applied in the motion information extraction module, specifically, the operation moves part of channels of the feature map along the time dimension so as to achieve the purposes of exchanging information between adjacent frames and extracting space-time features at low cost, but some researches treat the shift operation as a channel-level time convolution with fixed convolution kernel parameters, and the flexibility is not strong because the parameters of the shift operation are set manually and are not updated adaptively. Disclosure of Invention Based on the advantages and disadvantages of a main algorithm in the current action recognition field, the main purpose of the invention is to provide an action recognition method and system based on enhanced space-time characteristics, which can remove redundant information and acquire information in different time spans by adopting a strategy of sparse sampling of videos, give weight to each characteristic channel of a characteristic graph through channel-level motion information enhancement, promote beneficial information and inhibit interference information, lay a foundation for subsequent characteristic extraction, model space-time context information with low calculation cost through space-time characteristic aggregation, realize fusion of characteristics of adjacent frames in a self-adaptive mode, further extract advanced local characteristics of the videos through multi-branch output, and supplement global characteristics extracted by a single output layer of an original backbone network. In order to achieve the aim of the invention, the invention adopts the following technical scheme that the invention discloses an action recognition method based on enhanced space-time characteristics, which comprises the following steps: S1, determining a data