CN-121982774-A - Action recognition method of double-flow space-time diagram convolutional network based on attention mechanism

CN121982774ACN 121982774 ACN121982774 ACN 121982774ACN-121982774-A

Abstract

The application relates to an action recognition method of a double-flow space-time diagram convolution network based on an attention mechanism, which comprises the steps of detecting multi-person skeleton key points of an image to be detected, converting the multi-person skeleton key points into three-dimensional coordinates, screening core skeleton key points, constructing a dynamic space-time diagram based on the space connection relation and time sequence change of the core skeleton key points, constructing a three-dimensional attention mechanism comprising a time attention mechanism, a space attention mechanism and a channel attention mechanism based on the dynamic space-time diagram, constructing an ST-GCN network with a double-flow network framework with parallel space flow and time flow based on the three-dimensional attention mechanism, obtaining an action classification model and carrying out model training, and inputting the dynamic space-time diagram of the image to be detected into the action classification model after training to obtain a final action classification result. The method can effectively adapt to a plurality of scenes, improves the characteristic representation capability, and remarkably improves the accuracy and the robustness of motion recognition under a complex scene.

Inventors

NIU LING
SHEN QIUHUI
ZHU BIAN
Deng Boyan
WANG PENG
LI HELONG

Assignees

周口师范学院

Dates

Publication Date: 20260505
Application Date: 20260128

Claims (10)

1. An action recognition method of a double-flow space-time diagram convolutional network based on an attention mechanism is characterized by comprising the following steps: Detecting key points of a plurality of bones of an image to be detected based on Yolov s and BlazePose algorithm, converting the key points into three-dimensional coordinates, screening key points of a core bone, and constructing a dynamic space-time diagram based on the spatial connection relation and time sequence change of the key points of the core bone; based on the dynamic space-time diagram, constructing a three-dimensional attention mechanism comprising a time attention mechanism, a space attention mechanism and a channel attention mechanism, wherein the three-dimensional attention mechanism is used for carrying out key action feature strengthening and irrelevant interference suppression on different feature dimensions of the dynamic space-time diagram by generating normalized weights to obtain an attention enhancement feature matrix; Based on the three-dimensional attention mechanism, constructing an ST-GCN network with a double-flow network framework with parallel space flow and time flow, obtaining an action classification model and carrying out model training; and inputting the dynamic space-time diagram of the image to be detected into a motion classification model after training is completed, and obtaining a final motion classification result.
2. The method for identifying the actions of the dual-flow space-time diagram convolutional network based on the attention mechanism according to claim 1, wherein the steps of detecting the key points of the multi-person skeleton based on Yolov s and BlazePose algorithm, converting the key points into three-dimensional coordinates, screening the key points of the core skeleton, and constructing a dynamic space-time diagram based on the spatial connection relation and time sequence change of the key points of the core skeleton are as follows: adopting Yolov s and BlazePose algorithm fusion, starting strategy optimization through a boundary box updating and detecting module, and extracting key points of a plurality of bones; converting the two-dimensional pixel coordinates of the key points of the multi-person bones into three-dimensional space coordinates, and carrying out complement treatment on the missing or misdetected key points; Screening core skeleton key points, and constructing a dynamic space-time diagram based on the spatial connection relation and the time sequence of the core skeleton key points.
3. The method for identifying the actions of the dual-flow space-time diagram convolutional network based on the attention mechanism according to claim 2, wherein the steps of adopting Yolov s and BlazePose algorithm fusion, starting policy optimization through a bounding box updating and detecting module and extracting key points of a plurality of bones comprise the following steps: Performing person detection on the input video frame by using Yolov s algorithm, and outputting an initial boundary box of each target person, wherein the coordinate format is (x 1 ,y 1 ,x 2 ,y 2 ), and (x 1 ,y 1 ) is the upper left corner pixel coordinate of the boundary box and (x 2 ,y 2 ) is the lower right corner pixel coordinate; the light-weight face detector BlazePose is used for respectively shrinking 10% of pixels inwards in the width direction and the height direction of the initial boundary box to determine a gesture candidate region, so that edge interference is avoided; Detecting a gesture candidate region of the first frame image through a BlazePose gesture tracker, and outputting initial skeleton key point coordinates; Calculating offset vectors (delta x, delta y) of human bodies between adjacent frames, updating the positions of the boundary frames based on the vectors, restarting the human face detection module only when the absolute value of the offset vectors exceeds 30 pixels, and reducing the repeated detection times; and inputting MediaPipePose the updated single-person boundary boxes one by one to finish extraction of key points of the bones of multiple persons.
4. The method for identifying the actions of the dual-flow space-time diagram convolutional network based on the attention mechanism according to claim 2, wherein the conversion formula of the three-dimensional space coordinates is as follows: Wherein, the Is the abscissa of the three-dimensional space coordinate, Is the ordinate of the three-dimensional space coordinates, Is a vertical coordinate of the three-dimensional space coordinate, For the two-dimensional pixel abscissa of the multi-person skeletal key, Is the two-dimensional pixel ordinate of the multi-person bone key point, Is the abscissa of the coordinates of the principal point of the camera, Is the ordinate of the coordinates of the principal point of the camera, Is the focal length of the transverse axis of the camera, Is the focal length of the longitudinal axis of the camera.
5. The method for identifying actions of a dual-flow space-time diagram convolutional network based on an attention mechanism according to claim 2, wherein the step of screening core skeleton key points and constructing a dynamic space-time diagram based on the spatial connection relation and time sequence of the core skeleton key points comprises the steps of: From BlazePose original 33 key points, 17 core skeleton key points which are strongly related to action recognition are screened, wherein the core skeleton key points comprise a nose, a left eye, a right eye, a left ear, a right ear, a left shoulder, a right shoulder, a left elbow, a right elbow, a left wrist, a right wrist, a left hip, a right hip, a left knee, a right knee, a left ankle and a right ankle; and constructing a space connection side by taking the key points of the core bones as space nodes according to the natural connection relation of human bones, and constructing a dynamic space-time diagram by taking a key point coordinate sequence of 16 continuous frames on a time sequence as a time dimension.
6. The method for identifying actions of a dual-flow space-time graph convolutional network based on an attention mechanism according to claim 1, wherein said step of constructing a three-dimensional attention mechanism including a temporal attention mechanism, a spatial attention mechanism and a channel attention mechanism based on said dynamic space-time graph comprises: Setting a time dimension average pooling layer, wherein the pooling core size is 3 multiplied by 1, the step length is 1, setting up a one-dimensional convolution layer, the convolution core size is 3, the number of output channels is 1, accessing a Sigmoid activation layer to generate time attention weight with the dimension of 1 multiplied by 1 and the value range of 0-1; Setting a space dimension average pooling layer, wherein the pooling core size is 1 multiplied by 3, the step length is 1, setting up a one-dimensional convolution layer, the convolution core size is 3, the number of output channels is 1, accessing a Sigmoid activation layer to generate space attention weight with the dimension of 1 multiplied by N and the value range of 0-1; setting a global average pooling layer, compressing input features into global features with the dimension of 1 multiplied by C, adaptively determining the size of a 1D convolution kernel through a formula k=log2 (C) +1, building a 1D convolution layer to capture a cross-channel local interaction relationship, accessing a Sigmoid activation layer to generate channel attention weights with the dimension of 1 multiplied by C and the value range of 0-1, and designing a feature weighted fusion interface to obtain a channel attention mechanism.
7. The method for identifying actions of dual-flow space-time diagram convolutional network based on attention mechanism according to claim 6, wherein said steps of constructing ST-GCN network with dual-flow network frame of parallel spatial flow and time flow based on said three-dimensional attention mechanism, obtaining action classification model and performing model training comprise: Setting up a basic feature extraction link through the sequence of a first GCN layer, a first BN layer and a first ReLU activation layer, embedding a three-dimensional attention mechanism according to the sequence of a time attention layer, a space attention layer and a channel attention layer, and finally sequentially accessing a second BN layer, a second ReLU activation layer and a Dropout layer to obtain a module layer; Constructing an ST-GCN basic unit with 9 module layers, namely a third BN layer, 9 module layers, a global average pooling layer, a full connection layer and a Softmax activation layer in sequence, wherein the number of output channels from the 1 ST layer to the 3 rd layer of the 9 module layers is 64, the number of output channels from the 4 th layer to the 6 th layer is 128, and the number of output channels from the 7 th layer to the 9 th layer is 256; Respectively constructing an independent prepositive GCN layer for the space stream and the time stream, and sequentially accessing the ST-GCN basic unit, an equal weight fusion layer and an argmax function output layer to obtain an action classification model, wherein the equal weight fusion layer is used for carrying out equal weight fusion on two paths of probability scores output by the ST-GCN basic unit, and the argmax function output layer is used for outputting action classification with the largest probability score as an action classification result; And training the action classification model through a training data set.
8. The method for identifying the actions of the dual-flow space-time diagram convolutional network based on the attention mechanism according to claim 7, wherein the step of inputting the dynamic space-time diagram of the image to be tested into the action classification model after training to obtain the final action classification result comprises the following steps: inputting the dynamic space-time diagram of the image to be detected into a motion classification model after training is completed, and respectively extracting the space-related features and the time-related features of core skeleton key points in the dynamic space-time diagram of the image to be detected through a space flow preposed GCN layer and a time flow preposed GCN layer; sequentially passing the spatial correlation feature and the time correlation feature through a third BN layer, 9 module layers, a global average pooling layer, a full connection layer and a Softmax activation layer of the ST-GCN basic unit to respectively obtain probability scores of the spatial correlation feature and the time correlation feature; And fusing the probability score of the spatial correlation feature and the probability score of the time correlation feature by an equal weight fusion layer at each weight of 0.5 to obtain a fused probability score, and outputting the action classification with the maximum fused probability score as an action classification result by an argmax function output layer.
9. The action recognition method based on the attention mechanism double-flow space-time diagram convolutional network of claim 8, wherein the third BN layer is configured to perform batch normalization processing on features output by the spatial flow and the temporal flow preposed GCN layer, respectively, so as to eliminate feature distribution differences; The module layer comprises a first GCN layer, a first BN layer, a first ReLU activation layer, a time attention layer, a space attention layer, a channel attention layer, a second BN layer, a second ReLU activation layer and a Dropout layer, and is used for sequentially completing space-time feature primary extraction, feature normalization and nonlinear activation, three-dimensional dimension key feature reinforcement, secondary normalization and activation and random deactivation inhibition overfitting, so that gradual optimization and reinforcement of features are realized; the global averaging pooling layer is used for carrying out global averaging pooling on 256 channel features output by the 9 module layers, compressing T multiplied by N dimensional space-time features of each channel into 1 scalar, outputting a global feature vector with 256 multiplied by 1, and eliminating space-time dimensional redundancy; The full-connection layer is used for mapping 256-dimensional global feature vectors output by the global average pooling layer to M-dimensional action category original score vectors through a weight matrix to finish dimension reduction and mapping from high-dimensional features to category dimensions; The Softmax activation layer is used for normalizing the M-dimensional original score output by the full connection layer into a probability score of a 0-1 interval, and the sum of all category probabilities is 1.
10. The action recognition method based on the attention mechanism double-flow space-time diagram convolutional network according to claim 9, wherein the through-time attention layer is used for carrying out time dimension average pooling operation on dynamic space-time diagram features of the image to be detected, carrying out feature conversion through a one-dimensional convolutional layer, extracting time dimension key information, mapping the converted features to a 0-1 interval through a Sigmoid activation function, generating time attention weight, carrying out element-by-element multiplication on the time attention weight and the dynamic space-time diagram features of the image to be detected, strengthening key time frame features, adding residual connection after suppressing irrelevant time frame interference, and adding the dynamic space-time diagram features of the image to be detected and the multiplied features to generate time attention enhancement output features; Through a spatial attention layer, performing spatial dimension average pooling operation on the time attention enhancement output features, extracting spatial correlation features through one-dimensional convolution layer processing, generating spatial attention weights through a Sigmoid activation function, multiplying the spatial attention weights by the time attention enhancement output features element by element, strengthening core skeleton key point features, adding residual connection after inhibiting secondary skeleton key point interference, and adding the time attention enhancement output features and the multiplied features to obtain the spatial attention enhancement features; And carrying out global average pooling on the spatial attention enhancement features through a channel attention layer, compressing the features along the channel dimension to obtain global features with the dimension of 1 multiplied by C, calculating the one-dimensional convolution kernel size k according to the channel number C, extracting a cross-channel local interaction relation through the one-dimensional convolution layer, generating channel attention weights through a Sigmoid activation function, and multiplying the channel attention weights by the spatial attention enhancement features element by element to complete feature enhancement.

Description

Action recognition method of double-flow space-time diagram convolutional network based on attention mechanism Technical Field The application relates to the technical field of computer vision and deep learning, in particular to an action recognition method of a double-flow space-time diagram convolutional network based on an attention mechanism. Background With the rapid development of computer vision technology, human body action recognition is taken as one of the core research directions, and the method has wide application prospect in the fields of intelligent security, sports action analysis, medical rehabilitation monitoring, intelligent home interaction and the like. The core requirement of human motion recognition is to accurately capture the space-time variation characteristics of human body gestures and realize category judgment, wherein the recognition method based on skeleton key points is a currently mainstream technical path because the interference of clothes, background and the like can be effectively avoided, and the core structural information of human motion is directly focused. In particular, in a multi-person cooperative scene, such as public place behavior monitoring and team motion analysis, higher requirements are put forward on the real-time performance, robustness and accuracy of motion recognition, and a technical scheme capable of efficiently processing multi-person bone information and accurately extracting motion features is needed. In the prior art, in the conventional motion recognition technology, a graph rolling network (GCN) based method is a classical framework of skeleton motion recognition, and features are extracted by constructing a space-time graph structure of skeleton nodes, so that a certain recognition effect is achieved. Meanwhile, skeleton key point detection is mostly dependent on a single gesture estimation algorithm, such as BlazePose, mediaPipePose, and feature enhancement is mostly realized by adopting a single-dimension attention mechanism. However, the adaptability of the multi-person scene in the prior art is poor, the traditional gesture estimation algorithm is mainly designed for a single person, the problems of missed detection and confusion of key points are easy to occur when the traditional gesture estimation algorithm is directly applied to the multi-person scene, the computation complexity of GCN is linearly increased along with the number of people, the expandability is limited, the feature characterization precision is insufficient, the three-dimensional space structure of human body actions is difficult to accurately reflect due to recognition errors caused by visual angle changes due to the fact that most methods are based on recognition of two-dimensional skeleton key points, meanwhile, the feature extraction robustness is insufficient, the time dynamic, space association and channel feature difference of actions cannot be comprehensively captured by a single-dimension attention mechanism, and the key action information is difficult to be effectively strengthened and background interference is difficult to be restrained. Accordingly, there is a need to improve one or more problems in the related art as described above. It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art. Disclosure of Invention It is an aim of embodiments of the present disclosure to provide a method for motion recognition of a dual flow space-time diagram convolutional network based on an attention mechanism, which overcomes one or more problems due to limitations and disadvantages of the related art at least to some extent. The application provides an action recognition method of a double-flow space-time diagram convolutional network based on an attention mechanism, which comprises the following steps: Detecting key points of a plurality of bones of an image to be detected based on Yolov s and BlazePose algorithm, converting the key points into three-dimensional coordinates, screening key points of a core bone, and constructing a dynamic space-time diagram based on the spatial connection relation and time sequence change of the key points of the core bone; based on the dynamic space-time diagram, constructing a three-dimensional attention mechanism comprising a time attention mechanism, a space attention mechanism and a channel attention mechanism, wherein the three-dimensional attention mechanism is used for carrying out key action feature strengthening and irrelevant interference suppression on different feature dimensions of the dynamic space-time diagram by generating normalized weights to obtain an attention enhancement feature matrix; Based on the three-dimensional attention mechanism, constructing an ST-GCN network with a double-flow network frame