CN-121982772-A - Skeleton action recognition method based on partition space-time collaborative graph convolution network

CN121982772ACN 121982772 ACN121982772 ACN 121982772ACN-121982772-A

Abstract

The invention discloses a skeleton action recognition method based on a partition space-time collaborative graph convolution network, which relates to the technical field of computer vision and pattern recognition, and comprises the steps of firstly obtaining human skeleton sequence data through a depth sensor or a gesture estimation algorithm and preprocessing, then constructing a partition association graph, dividing part partitions based on a human natural structure and establishing a part internal stable topology and a part-crossing semantic topology, then introducing a part dynamic collaborative convolution (PDC-GC) module, cooperatively extracting global semantics and local dynamic characteristics of actions through part association modeling and dynamic neighborhood convolution, then introducing a space-time subdomain coding (STSE) module, capturing multi-scale space-time dependency through dividing four sub-domains, combining sparse coding, and finally constructing a partition space-time collaborative graph convolution network (PC-GCN) network, and fusing the PDC-GC module with STSE module and multi-level characteristics to realize high-precision skeleton action recognition.

Inventors

HUANG QIAN
He Zimeng
LI CHANG
DONG ZHUANG
MAO YINGCHI
SONG CHUNYU
ZENG JUN

Assignees

河海大学
南京海兴电网技术有限公司

Dates

Publication Date: 20260505
Application Date: 20260128

Claims (10)

1. A skeleton action recognition method based on a partition space-time collaborative graph rolling network is characterized by comprising the following steps: S1, acquiring human skeleton sequence data of different individuals under different scenes through a sensor or a gesture estimation algorithm, wherein the human skeleton sequence data comprises space-time position information of each joint of a human body, carrying out standardization processing on the acquired skeleton sequence, dividing the processed data set into a training set and a testing set according to a preset proportion, and carrying out label marking on each sequence according to action categories; S2, constructing a partition association graph, defining a human skeleton to contain V joint points, wherein the vertex set is expressed as Dividing human skeleton into K functional parts, defining joint subset of each part , Representing a vertex set containing all of the nodes, satisfies And is also provided with , And Respectively representing the ith and jth joint subsets, the elements of the different subsets are not coincident, Representing empty sets, creating intra-site partitions based on the partitioned sites And cross-site partitioning And construct an intra-site topology edge set And cross-site topology edge set Both topological edge sets contain node self-connections Centrifugal edge in part And centripetal edge Stacking the intra-site topological edge sets and the cross-site topological edge sets to obtain final edge sets Combining the final edge set Sum vertex set Obtaining an adjacency matrix for a partition association graph for feature extraction ; S3, constructing a part dynamic collaborative convolution module, wherein in the part dynamic collaborative convolution module, features of a part p are obtained according to part aggregation joint features based on the partition association graph Then modeling the site pairs by site-difference function Diff (∙) and similarity function Sim (∙) The correlation strength of the part is fused by a multi-layer perceptron to obtain an integral part correlation graph R, and then the part correlation graph R is combined Performing attention refinement convolution to obtain a global feature F, and constructing a dynamic neighborhood graph in a feature space through a K nearest neighbor algorithm to obtain a local feature Y through aggregation; S4, constructing a space-time subdomain coding module, wherein in the space-time subdomain coding module, the time dimension T is divided into a continuous Local frame (m) and a fixed-step Global frame Global (n), and the joint part is partitioned into blocks Zoning with crossing parts Forming four subfields, namely intra-site-local, trans-site-local, intra-site-global and trans-site-global subfields, remodelling the characteristic dimensions of each subfield, extracting space-time association by adopting multi-head self-attention, and introducing a sparse mask Finally, combining deviation characteristics of the single-layer GCN and the time convolution layer to obtain output of the space-time subdomain coding module; S5, constructing a partitioned space-time collaborative graph convolution network, wherein the partitioned space-time collaborative graph convolution network forms a network main body by stacking 10 space-time graph convolution STGC blocks, each STGC block comprises a position dynamic collaborative convolution module, a space-time subdomain coding module and residual connection, the output channel numbers of each STGC block are 64, 128, 256 and 256 in sequence, the output characteristics of the 4 th, 7 th and 10 th STGC blocks are fused by a multistage characteristic fusion module, model training is carried out by using an SGD (generalized discrete Fourier transform) optimizer, the first 5 epochs adopt a warm-up strategy, the learning rate is adjusted according to cosine annealing, cross entropy is adopted as a loss function, L2 is added, and the partitioned space-time collaborative graph convolution network is trained by training; S6, preprocessing a test set or a skeleton sequence in a real scene in the step S1, inputting the preprocessed skeleton sequence into a trained partition space-time collaborative graph convolutional network, extracting features through a part dynamic collaborative convolutional module and a space-time subdomain coding module, outputting action category probabilities through a multi-level fusion and classification layer, and taking a category corresponding to the maximum probability as a final recognition result.
2. The method for recognizing skeleton actions based on partition space-time collaborative graph rolling network of claim 1, wherein in step S1, standardization processing is performed on the collected skeleton sequences, joint coordinates are mapped to a unified numerical interval to eliminate individual scale differences, time window alignment is achieved on sequences with inconsistent lengths in an interpolation or truncation mode, and input data formats are guaranteed to be unified.
3. The method for recognizing skeleton actions based on partition space-time collaborative graph rolling network of claim 1, wherein in the step S2, human skeleton is divided into 6 functional parts, namely K=6, and joint subset of head is defined as The joint subset of the torso is defined as The joint subset of the left hand is defined as The subset of joints of the right hand is defined as The joint subset of the left leg is defined as The joint subset of the right leg is defined as Each part joint subset contains L elements, and if L is set to equal value, that is, each part partition contains the same number of joint points, v=kl exists.
4. The skeleton action recognition method based on the partition space-time collaborative graph rolling network of claim 3, wherein in the step S2, K part partitions are defined according to the part partitions The elements in each part are ordered in a way of extending outwards from the central area of human body, the joints at the same position in each part have the same movement trend, thus the first joint of each part is classified as a cross-part : ; ; Wherein K represents the number of sites, L represents the number of nodes in the site, Indicating the node located at the first in the kth site partition.
5. The method for identifying skeleton actions based on partition space-time collaborative graph rolling network of claim 4, wherein in step S2, nodes of each partition in each part are connected in two directions according to skeleton structure to construct topology edge sets in the part Connecting the nodes in each cross-site partition to form a cross-site topological edge set Both topological edge sets contain node self-connection, centrifugal edge and centripetal edge: ; ; Wherein, the Indicating that the node is self-connecting, Representing the centrifugal edge of the tube, Representing a centripetal edge.
6. The method for recognizing skeleton actions based on zoned spatiotemporal collaborative graph rolling network of claim 1, wherein in step S3, input joint feature tensor is set as Wherein T represents a time frame, C represents the number of channels, the parts are divided according to the step S2, and the joint characteristics of each part are polymerized in a cascading way to obtain the part characteristics: ; wherein L (p) represents the joint set of the p-th site, Features representing the j-th joint, concat (∙) representing cascade function, Features representing the p-th site obtained by intra-partition joint aggregation; Part pairs are measured from two dimensions of part characteristic difference and similarity The dependency among the components is calculated by firstly pooling global space-time characteristics of an action sequence through self-adaptive average, compressing characteristic dimensions through linear transformation, calculating the distance of each part in a characteristic space through a part characteristic difference function Diff (∙), and calculating the dot product of part pairs through Einstein summation through a part similarity function Sim (∙): ; ; Wherein the method comprises the steps of The activation function is represented as a function of the activation, Representing the time-space dimension adaptive mean pooling, And Representing a feature dimension compression linear transformation; Splicing the differences and the similarities along the channel dimension, and performing nonlinear transformation through a multi-layer perceptron to obtain a position association diagram R: 。
7. The method for identifying skeleton actions based on partitioned spatiotemporal collaborative graph convolution network according to claim 6, wherein in the step S3, a topological matrix calculation formula of attention-thinning convolution is: wherein Representing a bias in the position that can be learned, Representing a learning attention scaling parameter, The calculation formula of the final spatial convolution output F is: ; Wherein, the , Respectively representing a root node subset, a centrifugal node subset and a centripetal node subset; concat (#) represents a channel dimension concatenation; representing a site-awareness map; representing leachable weights for controlling part topology And Contribution of (2); Is by standardization of The obtained matrix comprises And Two layers; representing the weight of the k-th layer subset s.
8. The method for identifying skeleton actions based on zoned spatio-temporal collaborative graph rolling network according to claim 7, wherein in the step S3, a dynamic neighborhood graph is constructed in a feature space by using a K nearest neighbor algorithm to extract features, firstly, time dimensions are subjected to average pooling, and then TopK nearest neighbors of each joint are screened through feature similarity measurement and features of the dynamic neighborhood in the feature space are aggregated: ; Wherein, the Representing time-dimensional average pooling, KNN (∙) representing the operation of selecting the neighbors of each joint TopK, polymerization operation Employing a matrix of learnable weights And performing feature mapping, wherein C represents the channel number of the feature.
9. The method for identifying the skeleton action based on the partition space-time collaborative graph rolling network according to claim 1, wherein the step S4 specifically comprises the following steps: S4.1, directly using the part partition and the cross-part partition in the step S2 in the space dimension, dividing the time dimension T into two sub-time types of continuous local frames and fixed-step global frames, wherein the continuous local frames adopt sliding window non-overlapping sampling, and the fixed-step global frames adopt fixed-step sampling: ; ; Combining two types of division of space division and time division, combining the joint point and the time frame to obtain four subdomains, namely Sub1 to Sub4 respectively, wherein Sub1 is based on And Local (m) intra-region-Local spatiotemporal subdomains, sub2 is based on And Local (m) cross-site-Local spatiotemporal subdomains, sub3 is based on And Global in-site-Global spatiotemporal subdomains of Global (n), sub4 is based on And Global spatiotemporal subdomains that are cross-site to Global (n); the space-time subdomain will input dimensions Conversion to 、、 And Where t=mn, v=kl, Representing the number of channels of input characteristics of subdomains, calculating the characteristics of each subdomain by multi-head self-attention, and turning the dimension back ; S4.2, the dimension of the first input feature x is reshaped into Then obtaining query Q, key K and value V tensor through nonlinear mapping, and remolding Q, K and V into And introducing sparse masks In operation, the calculation range of attention is limited by local window constraint: ; Wherein, the Representing an ith sub-field input; 、、 Query Q, key K and value V tensor after nonlinear mapping of subdomain features respectively, with dimensions of ; Representing a sparse mask, B representing joint-time position bias; s4.3, connecting the results of the four space-time subfields along the channel dimension to obtain an output, and using a single GCN and a single time convolution layer to obtain a deviation in terms of bones and time And finally, merging and outputting.
10. The method for identifying skeleton actions based on partition space-time collaborative graph rolling network according to claim 1, wherein in the step S5, the multi-level feature fusion module fuses the output features of the 4 th, 7 th and 10 th STGC blocks in a weighted attention summation mode, and fuses the features of different channel numbers after unifying dimensions by 1X 1 convolution.

Description

Skeleton action recognition method based on partition space-time collaborative graph convolution network Technical Field The invention relates to the technical field of computer vision and pattern recognition, in particular to a skeleton action recognition method based on a partition space-time collaborative graph rolling network. Background Human motion recognition is one of core tasks in the field of computer vision, realizes judgment of motion types by analyzing human motion characteristics, and has important application value in actual scenes such as intelligent home interaction, public safety monitoring, senile health monitoring and the like. The skeleton sequence data is used as a key mode of motion recognition, motion dynamic characteristics can be effectively represented by capturing space-time position information of key joints of a human body, and compared with RGB video or depth video, the skeleton sequence data has stronger robustness to human body scale change, camera visual angle difference and complex background interference. The existing skeleton action recognition method is subjected to multiple development stages, namely, an early method relies on a manual design descriptor to simulate space-time variation of motion, but has weak manual characteristic generalization capability, a method based on RNN or CNN models skeleton data as vector sequences or pseudo images along with deep learning development, but ignores irregular topological structures of human skeletons, and is difficult to capture motion association among joints, a graph roll network (GCN) solves the problem of topology modeling, such as, an ST-GCN firstly represents the skeletons as space-time diagrams, local information is accumulated through graph rolls, but a fixed topology is adopted, remote joint dependence is ignored, AGCN is introduced into an adaptive graph structure, and CTR-GCN improves performance through channel level topological optimization, but the methods still have the defects: 1. the topology structure lacks part semantic constraint that the existing GCN method does not explicitly model human body function parts (such as heads and limbs) and cannot effectively distinguish interaction modes of cross-part nodes (such as coordination of left and right hands in 'applause' actions), so that action semantic mining is insufficient. 2. The decoupling of the space-time characteristics results in information loss, most methods adopt a decoupling mode of 'space-diagram convolution and time-diagram convolution', the internal correlation of a space structure and time dynamics is split, the time convolution depends on a fixed kernel and an expansion rate, and the capturing capability of multi-scale space-time information is limited. 3. The calculation complexity is high, and the partial self-adaptive graph or the attention method needs to calculate the whole joint pair, so that the quantity and the calculated amount of model parameters are increased rapidly, and the model parameters are difficult to deploy in a resource-limited scene. Disclosure of Invention In order to solve the technical problems, the invention provides a skeleton action recognition method based on a partition space-time collaborative graph rolling network, which comprises the following steps: S1, acquiring human skeleton sequence data of different individuals under different scenes through a sensor or a gesture estimation algorithm, wherein the human skeleton sequence data comprises space-time position information of each joint of a human body, carrying out standardization processing on the acquired skeleton sequence, dividing the processed data set into a training set and a testing set according to a preset proportion, and carrying out label marking on each sequence according to action categories; S2, constructing a partition association graph, defining a human skeleton to contain V joint points, wherein the vertex set is expressed as Dividing human skeleton into K functional parts, defining joint subset of each part,Representing a vertex set containing all of the nodes, satisfiesAnd is also provided with,AndRespectively representing the ith and jth joint subsets, the elements of the different subsets are not coincident,Representing empty sets, creating intra-site partitions based on the partitioned sitesAnd cross-site partitioningAnd construct an intra-site topology edge setAnd cross-site topology edge setBoth topological edge sets contain node self-connectionsCentrifugal edge in partAnd centripetal edgeStacking the intra-site topological edge sets and the cross-site topological edge sets to obtain final edge setsCombining the final edge setSum vertex setObtaining an adjacency matrix for a partition association graph for feature extraction; S3, constructing a part dynamic collaborative convolution module, wherein in the part dynamic collaborative convolution module, features of a part p are obtained according to part aggregation joint features based on