CN-121999530-A - Human motion prediction method and device based on intention understanding graph convolution network

CN121999530ACN 121999530 ACN121999530 ACN 121999530ACN-121999530-A

Abstract

The application discloses a human motion prediction method and device based on an intention understanding graph convolution network, which have the advantages that a dynamic intention adjacency matrix based on speed similarity is constructed, so that the semantic association relation of the change of joint nodes along with time can be captured, the limitation that the traditional static adjacency matrix can not reflect action intention is overcome, a model can understand potential motivation behind the action, an intention understanding module is introduced to be combined with a space-time graph convolution network, the combined feature modeling of a semantic layer and a motion layer is realized, the accuracy and semantic consistency of multi-meaning action prediction are effectively improved, prediction ambiguity is reduced, intention feature aggregation is carried out by adopting a graph attention network, the contribution of neighbor node weights can be distributed in a self-adaptive mode, the noise interference is reduced, the space-time feature and the intention feature are integrated by a weighted fusion strategy, and a prediction sequence is generated by utilizing the time convolution network, and the continuity, smoothness and stability of a prediction result are ensured.

Inventors

REN ZILIANG
LI MENGYAO
WEI WENHONG
ZHANG FUYONG
ZHAO HUI

Assignees

东莞理工学院

Dates

Publication Date: 20260508
Application Date: 20260122

Claims (10)

1. The human motion prediction method based on the intention understanding graph rolling network is characterized by comprising the following steps of: obtaining a skeleton sequence of a measured object; carrying out space-time feature extraction on the skeleton sequence to obtain a space-time high-dimensional feature vector, and carrying out intention similarity matrix construction on the skeleton sequence to obtain an intention adjacency matrix; Carrying out intention feature aggregation on the space-time high-dimensional feature vector and the intention adjacency matrix to obtain an intention perception feature vector; Performing feature fusion on the space-time high-dimensional feature vector and the intention perception feature vector to obtain a joint feature vector; and inputting the joint feature vector into a prediction network to obtain a predicted motion time sequence of the measured object.
2. The method of claim 1, wherein the skeleton sequence comprises joint information of a plurality of joints of the object under test, the joint information comprising joint spatial structure information and continuous joint temporal dynamics information.
3. The method of claim 2, wherein the framework sequence is Wherein C is the number of channels, T is the time step, and V is the number of nodes.
4. A method according to claim 3, wherein performing space-time feature extraction on the skeleton sequence to obtain a space-time high-dimensional feature vector comprises: constructing a space diagram convolution network, and processing the skeleton sequence based on the space diagram convolution network to obtain a space feature vector; And carrying out time dimension feature extraction on the space feature vector based on one-dimensional convolution to obtain the space-time high-dimensional feature vector.
5. A method according to claim 3, wherein the spatial map convolutional network is represented by the following formula: ; Wherein, the , Is the first in the space diagram The number of the nodes of the gateway, Is a joint point Neighbor node of (b), space diagram The number of the nodes of the gateway, Convolving a spatial map at a joint point The spatial feature vector at which the position is located, Representing nodes Is used to determine the neighbor set of a neighbor, Representing the node feature vectors input to the spatial map convolution, In order to input the motion data, As a result of the number of channels entered, The weight map is represented by a representation of the weight map, For the normalization constant(s), As a marker function, it will neighbor the joint point At a relative central node Mapped to a discrete label.
6. The method of claim 5, wherein the constructing the intent similarity matrix for the skeletal sequence to obtain an intent adjacency matrix comprises: calculating the instantaneous speed vector and the motion direction vector of each joint in the current time step, wherein the instantaneous speed vector of all joints in the current time step forms an instantaneous speed vector set, and the motion direction vector of all joints in the current time step forms a motion direction vector set; Performing similarity calculation on all joints based on the instantaneous speed vector set and the motion direction vector set to obtain a similarity matrix; Normalizing the similarity matrix to obtain the intended adjacency matrix.
7. The method of claim 6, wherein the aggregating the temporal-spatial high-dimensional feature vector and the intent adjacency matrix to obtain an intent-aware feature vector comprises: Performing linear transformation on the space-time high-dimensional feature vector in a high-dimensional space to obtain an intention feature vector; Performing attention coefficient calculation on the intention adjacency matrix based on the intention feature vector to obtain an attention coefficient matrix; Splicing and dot-product the intention adjacency matrix and the attention coefficient matrix, and normalizing the pairs to obtain a weight matrix; And carrying out weighted aggregation on the space-time high-dimensional feature vector based on the weight matrix to obtain the intention perception feature vector.
8. The method of claim 7, wherein the prediction network is a decoder of multiple temporal convolutional layers.
9. The method of claim 8, wherein the predictive network is optimized by a mean square error loss function.
10. A human motion prediction apparatus based on an intent understanding graph convolution network, comprising: The data acquisition unit is used for acquiring a skeleton sequence of the tested object; the data processing unit is used for extracting space-time characteristics of the skeleton sequence to obtain a space-time high-dimensional characteristic vector, and constructing an intention similarity matrix of the skeleton sequence to obtain an intention adjacency matrix; the intention feature aggregation unit is used for carrying out intention feature aggregation on the space-time high-dimensional feature vector and the intention adjacency matrix to obtain an intention perception feature vector; The data fusion unit is used for carrying out feature fusion on the space-time high-dimensional feature vector and the intention perception feature vector to obtain a joint feature vector; And the motion prediction unit is used for inputting the joint feature vector into a prediction network to obtain a predicted motion time sequence of the measured object.

Description

Human motion prediction method and device based on intention understanding graph convolution network Technical Field The invention relates to the technical field of human body motion analysis, in particular to a human body motion prediction method and device based on an intention understanding graph rolling network. Background At present, in research of human motion recognition and motion prediction, traditional machine learning methods comprise a Hidden Markov Model (HMM), a Conditional Random Field (CRF) and the like, and the methods model motion time sequences through state transition probabilities, so that a certain effect is obtained in early stage. However, because human body actions have high-dimensional nonlinear characteristics and complex space-time dependency, the traditional model has limited performance when processing long-sequence dynamic and diversified actions. In recent years, a deep learning method has become a mainstream, and common models include a Recurrent Neural Network (RNN), a generation countermeasure network (GAN), a variational self-encoder (VAE), a graph annotation force network (GAT), a graph convolution neural network (GCN), and the like. The method automatically extracts the space and time characteristics in an end-to-end mode, and greatly improves the accuracy and the robustness of motion recognition and prediction. However, most existing models focus on feature modeling at the geometric and motion level, ignoring the semantic "intent" information behind the action. Human actions are usually driven by intent, and follow-up actions corresponding to different figures are completely different. Due to the lack of explicit expression of the intent layer features, the existing model is easy to generate semantic blurring and pattern collapse problems in ambiguous action prediction, has a single prediction result, and lacks diversity and interpretation. In addition, the depth model training needs a large amount of annotation data, the random sampling efficiency is low, the calculation cost is high, and the method is not suitable for scenes with high real-time requirements. Therefore, there is a need to design an efficient prediction method combining semantic intent modeling and graph convolution structure, so that the model can fully utilize the historical skeleton information and dynamically perceive the action intent of the human body, thereby improving the accuracy, stability and semantic understanding capability of prediction. Disclosure of Invention Accordingly, it is necessary to provide a human motion prediction method and apparatus based on an intention understanding graph rolling network, in order to solve the conventional problems. In a first aspect, an embodiment of the present application provides a human motion prediction method based on an intent understanding graph rolling network, including the steps of: obtaining a skeleton sequence of a measured object; carrying out space-time feature extraction on the skeleton sequence to obtain a space-time high-dimensional feature vector, and carrying out intention similarity matrix construction on the skeleton sequence to obtain an intention adjacency matrix; Carrying out intention feature aggregation on the space-time high-dimensional feature vector and the intention adjacency matrix to obtain an intention perception feature vector; Performing feature fusion on the space-time high-dimensional feature vector and the intention perception feature vector to obtain a joint feature vector; and inputting the joint feature vector into a prediction network to obtain a predicted motion time sequence of the measured object. Preferably, the skeleton sequence includes joint information of a plurality of joints of the measured object, and the joint information includes joint space structure information and continuous time dynamic information. Preferably, the backbone sequence isWherein C is the number of channels, T is the time step, and V is the number of nodes. Preferably, the extracting the space-time feature of the skeleton sequence to obtain a space-time high-dimensional feature vector includes: constructing a space diagram convolution network, and processing the skeleton sequence based on the space diagram convolution network to obtain a space feature vector; And carrying out time dimension feature extraction on the space feature vector based on one-dimensional convolution to obtain the space-time high-dimensional feature vector. Preferably, the space diagram convolutional network is expressed by the following formula: ; Wherein, the ,Is the first in the space diagramThe number of the nodes of the gateway,Is a joint pointNeighbor node of (b), space diagramThe number of the nodes of the gateway,Convolving a spatial map at a joint pointThe spatial feature vector at which the position is located,Representing nodesIs used to determine the neighbor set of a neighbor,Representing the node feature vectors input to the spatial map convoluti