CN-122024330-A - Lightweight gesture prediction method and device based on space-time decoupling and dynamic graph filtering

CN122024330ACN 122024330 ACN122024330 ACN 122024330ACN-122024330-A

Abstract

The invention discloses a lightweight gesture prediction method and device based on space-time decoupling and dynamic graph filtering, wherein the method comprises the steps of inputting a skeleton sequence into a space-time decoupling converter network; the method comprises the steps of carrying out dynamic graph filtering and multi-scale time convolution processing on a skeleton sequence based on a dual-path self-adaptive graph convolution sub-block in a lightweight space-time block and a multi-scale time convolution sub-block in the lightweight space-time block to obtain local space-time characteristics, carrying out space-time position coding processing and mixed attention processing on the local space-time characteristics based on a global space-time converter to obtain global space-time characteristics, carrying out characteristic fusion operation on the local space-time characteristics and the global space-time characteristics to obtain target characteristic vectors, and inputting the target characteristic vectors into a full-connection layer and a Softmax function to obtain gesture category prediction probability. The gesture prediction method and the gesture prediction device can improve accuracy and robustness of gesture prediction, and can be widely applied to the technical fields of computer vision and motion recognition.

Inventors

WANG TAO
ZHAN QIANQIAN
XIE FENG
Rao Zihao
LIU BO

Assignees

中山大学

Dates

Publication Date: 20260512
Application Date: 20260410

Claims (10)

1. A lightweight gesture prediction method based on space-time decoupling and dynamic graph filtering is characterized by comprising the following steps: Inputting a skeleton sequence to a space-time decoupling converter network, wherein the space-time decoupling converter network comprises a dynamic local graph aggregator and a global space-time converter, and the dynamic local graph aggregator comprises a plurality of lightweight space-time blocks; Based on the dual-path adaptive graph convolution sub-block in the lightweight space-time block and the multi-scale time convolution sub-block in the lightweight space-time block, carrying out dynamic graph filtering and multi-scale time convolution processing on the skeleton sequence to obtain local space-time characteristics; performing space-time position coding processing and mixed attention processing on the local space-time characteristics based on the global space-time converter to obtain global space-time characteristics; Performing feature fusion operation on the local space-time features and the global space-time features to obtain target feature vectors; and inputting the target feature vector into a full connection layer and a Softmax function to obtain the gesture type prediction probability.
2. The method according to claim 1, wherein the performing a dynamic graph filtering and a multi-scale time convolution process on the bone sequence based on the dual-path adaptive graph convolution sub-block in the lightweight spatio-temporal block and the multi-scale time convolution sub-block in the lightweight spatio-temporal block to obtain a local spatio-temporal feature comprises the steps of: Taking the skeleton sequence as an input characteristic, and inputting the input characteristic into the lightweight space-time block; Carrying out dynamic graph filtration on the input features through a plurality of double-path adaptive graph convolution sub-blocks, and outputting first intermediate features; Inputting the first intermediate feature into the multi-scale time convolution sub-block, performing multi-scale time convolution processing on the first intermediate feature, and outputting a second intermediate feature; returning to the step of inputting the input feature into the lightweight spatiotemporal block by taking the second intermediate feature as the input feature until the multi-scale time convolution sub-block of the last lightweight spatiotemporal block outputs the second intermediate feature, and taking the second intermediate feature as the local spatiotemporal feature; Wherein the lightweight spatio-temporal block comprises the multi-scale temporal convolution sub-block and a plurality of the dual-path adaptive graph convolution sub-blocks.
3. The method of claim 2, further comprising the step of: And respectively performing time downsampling processing on the second intermediate feature output by the 5 th lightweight space-time block and the second intermediate feature output by the 8 th lightweight space-time block through stride convolution with the step length of 2 to obtain the second intermediate feature after the time downsampling processing.
4. The method of claim 2, wherein said dynamically graph filtering said input features by a plurality of said dual path adaptive graph convolution sub-blocks to output a first intermediate feature, comprising the steps of: Performing deep convolution operation on the input features to obtain local time features; Performing a first point-by-point convolution operation on the local time feature to obtain a relationship feature; performing a second point-by-point convolution operation on the local time feature to obtain an output feature; averaging the relation features in a time dimension to obtain a relation tensor; constructing a first differential relation of a relation element pair according to each relation element in the relation tensor; Compressing the first differential relation projection to a channel dimension to obtain a second differential relation, and fusing the second differential relation with an anatomic skeleton matrix to obtain a channel self-adaptive space adjacent matrix; obtaining a channel-level smoothing feature according to the output feature and the channel self-adaptive space adjacency matrix; Averaging the output characteristics in the channel dimension to obtain an output tensor; constructing a third differential relation of output element pairs according to each output element in the output tensor; Acquiring a frame-level dynamic characteristic according to the output characteristic and the third differential relation; carrying out weighted fusion on the channel-level smooth feature and the frame-level dynamic feature to obtain a third intermediate feature; And connecting a plurality of third intermediate features output by the two-path adaptive graph convolution sub-blocks to obtain the first intermediate features.
5. The method of claim 2, wherein said inputting the first intermediate feature into the multi-scale temporal convolution sub-block performs a multi-scale temporal convolution process on the first intermediate feature to output a second intermediate feature, comprising the steps of: performing first expansion convolution on the first intermediate feature to obtain a first branch feature; performing second expansion convolution on the first intermediate feature to obtain a second branch feature; performing maximum pooling operation on the first intermediate feature to obtain a third branch feature; performing a third point-by-point convolution operation on the first intermediate feature to obtain a fourth branch feature; and splicing the first branch feature, the second branch feature, the third branch feature and the fourth branch feature to obtain the second intermediate feature.
6. The method according to claim 1, wherein the performing space-time position coding and mixed attention processing on the local space-time features based on the global space-time transformer to obtain global space-time features comprises the following steps: Carrying out dimension rearrangement and flattening treatment on the local space-time features to obtain a space-time word element feature matrix; Presetting space-time position codes according to the skeleton sequences; fusing the space-time word element feature matrix with the space-time position codes to obtain joint coding features; constructing a graph bias attention according to the scaled dot product attention; Performing mixed attention processing on the joint coding features according to the scaled dot product attention, the graph bias attention and the quality perception gating weight to obtain mixed output; residual connection is carried out on the mixed output and the joint coding feature to obtain intermediate output, and point-by-point MLP processing is carried out on the intermediate output to obtain the global space-time feature; The global space-time transformer comprises the space-time position codes and mixed attention, wherein the mixed attention comprises the scaled dot product attention and the graph offset attention.
7. The method of claim 6, wherein constructing the graph bias attention from the scaled dot product attention comprises the steps of: Copying the anatomic skeleton matrix to a block diagonal to obtain a block diagonal matrix; and expanding the scaling dot product attention according to the block diagonal matrix to obtain the graph offset attention.
8. The method of claim 1, further comprising the step of: Performing degradation statistics extraction operation on the skeleton sequence to obtain inconsistency of frame loss rate, jitter intensity and skeleton length; constructing a quality descriptor according to the frame loss rate, the jitter intensity and the inconsistency of the bone lengths; and constructing quality perception gating weights according to the Sigmoid activation function, the multi-layer perceptron and the quality descriptors.
9. The method according to claim 1, wherein the feature fusion operation is performed on the local space-time feature and the global space-time feature to obtain a target feature vector, and the method comprises the following steps: carrying out global average pooling treatment on the local space-time features to obtain a first feature to be fused; Carrying out global average pooling treatment on the global space-time features to obtain second features to be fused; performing linear projection on the second feature to be fused to obtain a third feature to be fused; and carrying out feature fusion operation on the first feature to be fused and the third feature to be fused to obtain the target feature vector.
10. A lightweight gesture prediction device based on space-time decoupling and dynamic graph filtering, comprising: The system comprises a data input module, a space-time decoupling converter network, a dynamic local graph aggregator and a global space-time converter, wherein the data input module is used for inputting a skeleton sequence to the space-time decoupling converter network; The local structure filtering module is used for carrying out dynamic graph filtering and multi-scale time convolution processing on the skeleton sequence based on the dual-path adaptive graph convolution sub-block in the lightweight time-space block and the multi-scale time convolution sub-block in the lightweight time-space block to obtain local time-space characteristics; the global space-time word element reasoning module is used for carrying out space-time position coding processing and mixed attention processing on the local space-time characteristics based on the global space-time converter to obtain global space-time characteristics; The self-adaptive feature fusion module is used for carrying out feature fusion operation on the local space-time features and the global space-time features to obtain target feature vectors; And the gesture prediction module is used for inputting the target feature vector into the full-connection layer and the Softmax function to obtain gesture category prediction probability.

Description

Lightweight gesture prediction method and device based on space-time decoupling and dynamic graph filtering Technical Field The invention relates to the technical field of computer vision and motion recognition, in particular to a lightweight gesture prediction method and device based on space-time decoupling and dynamic graph filtering. Background As a very promising human-computer interaction (HRI) soft sensing modality, skeletal data streams derived from RGB-D optical sensors and real-time key point estimators (e.g., MEDIAPIPE, OPENPOSE) constitute lightweight virtual sensor readings that are not only privacy preserving but also have extremely high bandwidth efficiency compared to the original image transmission. However, these virtual sensor outputs inherit the characteristic measurement uncertainties inherent in the underlying RGB-D hardware, including jitter, frame loss (occlusion), drift, body motion, and instability caused by illumination. These reliability and metrology challenges constitute an obstacle to deploying robust gesture sensing systems, particularly on resource-constrained edge platforms that are important to both recognition accuracy and real-time response capability. Furthermore, in delay-sensitive, privacy-preserving and bandwidth-efficient applications, the demand for a near-sensor edge AI (which performs reasoning directly on the acquisition device, rather than offloading it to a remote server) is currently increasing. However, edge deployment faces the dilemma that the model must achieve extremely high recognition accuracy, remain robust to noisy and missing bone data, and must run in real-time under extremely severe computational budget, and existing bone-based approaches are difficult to meet both of these three requirements because the graph-rolling network (GCN) approach lacks global time modeling capabilities, while the transducer approach can introduce extremely high computational complexity to the embedded platform. Disclosure of Invention In view of the above, the embodiment of the invention mainly aims to provide a lightweight gesture prediction method and device based on space-time decoupling and dynamic graph filtering, so as to solve at least one of the problems in the prior art. To achieve the above object, an aspect of an embodiment of the present invention provides a lightweight gesture prediction method based on space-time decoupling and dynamic graph filtering, where the method includes: Inputting a skeleton sequence to a space-time decoupling converter network, wherein the space-time decoupling converter network comprises a dynamic local graph aggregator and a global space-time converter, and the dynamic local graph aggregator comprises a plurality of lightweight space-time blocks; Based on the dual-path adaptive graph convolution sub-block in the lightweight space-time block and the multi-scale time convolution sub-block in the lightweight space-time block, carrying out dynamic graph filtering and multi-scale time convolution processing on the skeleton sequence to obtain local space-time characteristics; performing space-time position coding processing and mixed attention processing on the local space-time characteristics based on the global space-time converter to obtain global space-time characteristics; Performing feature fusion operation on the local space-time features and the global space-time features to obtain target feature vectors; and inputting the target feature vector into a full connection layer and a Softmax function to obtain the gesture type prediction probability. In some embodiments, the performing dynamic graph filtering and multi-scale time convolution processing on the skeleton sequence based on the dual-path adaptive graph convolution sub-block in the lightweight space-time block and the multi-scale time convolution sub-block in the lightweight space-time block to obtain local space-time characteristics includes the following steps: Taking the skeleton sequence as an input characteristic, and inputting the input characteristic into the lightweight space-time block; Carrying out dynamic graph filtration on the input features through a plurality of double-path adaptive graph convolution sub-blocks, and outputting first intermediate features; Inputting the first intermediate feature into the multi-scale time convolution sub-block, performing multi-scale time convolution processing on the first intermediate feature, and outputting a second intermediate feature; returning to the step of inputting the input feature into the lightweight spatiotemporal block by taking the second intermediate feature as the input feature until the multi-scale time convolution sub-block of the last lightweight spatiotemporal block outputs the second intermediate feature, and taking the second intermediate feature as the local spatiotemporal feature; Wherein the lightweight spatio-temporal block comprises the multi-scale temporal convolution sub-block and a plurality