CN-116012941-B - Model training method, skeleton action recognition method, device and storage medium

CN116012941BCN 116012941 BCN116012941 BCN 116012941BCN-116012941-B

Abstract

The application provides a model training method based on skeleton action recognition, a skeleton action recognition method, a skeleton action recognition device and a computer readable storage medium. The model training method comprises the steps of respectively executing convolution operation of other feature dimensions on a first local feature, a second local feature and a third local feature by using a global information modeling module of a skeleton motion recognition model to obtain a first global feature corresponding to the first local feature, a second global feature corresponding to the second local feature and a third global feature corresponding to the third local feature, fusing the first global feature, the second global feature and the third global feature to obtain global features, and fusing the local features and the global features to obtain fusion features. Through the mode, the skeleton action recognition device can comprehensively, effectively and efficiently mine the global information of the skeleton data time-space domain through the novel multi-view global information modeling module, and the training effect and recognition effect of the skeleton action recognition model are improved.

Inventors

LI XINGMING
Dun Jingyu

Assignees

浙江大华技术股份有限公司

Dates

Publication Date: 20260508
Application Date: 20221226

Claims (12)

1. A model training method based on skeleton action recognition, characterized in that the model training method comprises: acquiring a skeleton video to be trained; extracting local characteristics of the skeleton video to be trained by using a local information modeling module of the skeleton action recognition model; Acquiring a first local feature, a second local feature and a third local feature of the local feature corresponding to three feature dimensions; Performing convolution operations of other feature dimensions on the first local feature, the second local feature and the third local feature by using a global information modeling module of the skeleton motion recognition model so as to acquire a first global feature corresponding to the first local feature, a second global feature corresponding to the second local feature and a third global feature corresponding to the third local feature; Fusing the first global feature, the second global feature and the third global feature to obtain global features, and fusing the local features and the global features to obtain fusion features; Inputting the fusion characteristics into a classifier of the skeleton action recognition model to obtain the prediction category of the skeleton video to be trained; training the skeleton action recognition model based on the prediction category and the label category of the skeleton video to be trained; The convolution operation of the other characteristic dimension is a position sensitive convolution operation; The position sensitive convolution operation comprises the steps of determining the self feature dimension and other feature dimensions of a second local feature, acquiring the position codes of skeleton joint points in the self feature dimension and the convolution action dimension, and carrying out convolution processing by utilizing the position codes in the self feature dimension and the convolution action dimension and the splicing features of the second local feature to obtain a second global feature corresponding to the second local feature; The convolution processing is performed by using the position codes on the self feature dimension and the convolution dimension and the splicing features of the second local feature, so as to obtain a second global feature corresponding to the second local feature, which comprises the following steps: Performing dimension copying on the position codes on the self feature dimension and the convolution action dimension based on the feature dimension of the second local feature to obtain a copy position code with the feature dimension length identical to the feature dimension length of the second local feature; Adding the second local feature and the copy position code to obtain a second local feature sensitive to the position, and splicing the second local feature sensitive to the position with the second local feature according to the direction of the convolution dimension to obtain a spliced local feature; And carrying out convolution processing on the spliced local features by using a one-dimensional convolution kernel with the direction along the convolution action dimension and the size being the length of the convolution action dimension on the second local features to obtain second global features corresponding to the second local features.
2. The method for training a model according to claim 1, The step of obtaining the position codes of the skeleton joint points in the self characteristic dimension and the convolution action dimension comprises the following steps: Acquiring a position index of the convolution dimension and the length of the self characteristic dimension; Acquiring a position index of the skeleton node in the characteristic dimension of the skeleton node; And acquiring the position code of the skeleton joint point according to the position index of the convolution action dimension, the length of the self characteristic dimension and the position index of the self characteristic dimension.
3. The method for training a model according to claim 1, The other feature dimensions comprise a first feature dimension and a second feature dimension of the second local feature except the feature dimension of the second local feature; The convolution processing is performed by using the position codes on the self feature dimension and the convolution dimension and the splicing features of the second local feature, so as to obtain a second global feature corresponding to the second local feature, which comprises the following steps: Performing convolution processing by using the position codes of the first feature dimension and the self feature dimension and the spliced features of the second local features to obtain second local sub-features; Performing convolution processing by using the position codes of the second characteristic dimension and the self characteristic dimension and the spliced characteristic of the second local sub-characteristic to obtain a second local output characteristic; and fusing the second local feature and the second local output feature to obtain a second global feature corresponding to the second local feature.
4. The method for training a model according to claim 1, The fusing the local feature and the global feature to obtain a fused feature comprises the following steps: acquiring first local features and global features of the local features by using a fusion modeling module of the skeleton motion recognition model; Acquiring a first fusion feature of fusion output of the first local feature and the global feature; acquiring a second fusion feature which is obtained by fusing the first local feature and the first fusion feature, and acquiring a first global feature of the second fusion feature; Acquiring a third fusion feature of fusion output of the global feature and the first fusion feature, and acquiring a secondary local feature of the third fusion feature; And carrying out fusion output on the first global feature and the second local feature to obtain the fusion feature.
5. The method for training a model according to claim 4, The step of fusing and outputting the first global feature and the second local feature to obtain the fused feature comprises the following steps: And carrying out 1X 1 convolution processing on the spliced characteristic of the first global characteristic and the second local characteristic to obtain the fusion characteristic.
6. The method for training a model according to claim 1, The fusing the local feature and the global feature to obtain a fused feature comprises the following steps: fusing the local features and the global features to obtain fused features; inputting the fusion features into a local information modeling module, a fusion modeling module and a local information modeling module in sequence to obtain high-level fusion features; Inputting the fusion features into a classifier of the skeleton action recognition model to obtain a prediction category of the skeleton video to be trained, wherein the method comprises the following steps: And carrying out feature fusion on the local features, the fusion features and the high-level fusion features, inputting a feature fusion result into a classifier of the skeleton action recognition model, and obtaining the prediction category of the skeleton video to be trained.
7. The method of model training according to claim 6, wherein, The fusion features are sequentially input into a local information modeling module, a fusion modeling module and a local information modeling module, and the high-level fusion features are obtained, comprising: inputting the fusion characteristics into the local information modeling module, and performing downsampling processing on the fusion characteristics through the local information modeling module to obtain first fusion characteristics; and sequentially inputting the first fusion characteristics into a fusion modeling module and a local information modeling module to obtain the high-level fusion characteristics.
8. The method for training a model according to claim 1, The local information modeling module for identifying the model by utilizing the skeleton action extracts local characteristics of the skeleton video to be trained, and comprises the following steps: Extracting initial characteristics of the skeleton video to be trained by utilizing a local information modeling module of the skeleton action recognition model; performing depth-level graph convolution operation on the skeleton video to be trained to obtain a skeleton feature graph; carrying out one-dimensional convolution operation along the time dimension on the skeleton feature map to obtain time-space domain local features; Adding the initial feature and the time-space domain local feature to obtain a time-space domain fusion local feature; Inputting the time-space domain fusion local features into a feedforward neural network to obtain feedforward local features; and adding the time-space domain fusion local feature and the feedforward local feature to obtain the local feature.
9. The method for training a model according to claim 8, The obtaining the skeleton video to be trained comprises the following steps: Acquiring a skeleton video; equally dividing the skeleton video into a plurality of video segments along a time axis; and intercepting the same number of skeleton video frames from each video segment to form the skeleton video to be trained.
10. A skeleton-motion recognition method, characterized in that the skeleton-motion recognition method comprises: Acquiring a skeleton video; inputting the skeleton video into a pre-trained skeleton motion recognition model, wherein the skeleton motion recognition model is trained by the model training method of any one of claims 1 to 9; and obtaining the skeleton action category output by the skeleton action recognition model.
11. A skeleton-motion recognition apparatus, comprising a processor and a memory, the memory having stored therein program data, the processor being configured to execute the program data to implement the model training method of any one of claims 1 to 9, and/or the skeleton-motion recognition method of claim 10.
12. A computer readable storage medium for storing program data which, when executed by a processor, is adapted to carry out the model training method of any one of claims 1 to 9 and/or the skeleton action recognition method of claim 10.

Description

Model training method, skeleton action recognition method, device and storage medium Technical Field The present application relates to the field of computer vision, and in particular, to a model training method based on skeleton motion recognition, a skeleton motion recognition method, a skeleton motion recognition device, and a computer readable storage medium. Background The graph convolution has excellent performance on skeleton-based motion recognition tasks due to its good topological expressive power on non-European structure data. But the operation of graph convolution is limited to local neighbors, which limits the ability of graph convolution to capture global information. The lack of global information makes the skeleton action recognition method based on graph convolution easily confuse locally similar actions. To overcome this drawback of graph convolution, researchers have proposed many solutions. To capture a longer range of timing dependencies, long-term memory networks (LSTM) were introduced into the graph-rolling network. In order to capture global information of a skeleton time-space domain at the same time, a self-attention mechanism-based method is introduced into graph convolution, for example, an attention matrix calculated in a time-space dimension by non-local operation is used as an adjacent matrix of the graph convolution, a transform-based method is introduced into the field of skeleton action recognition, and multi-head self-attention is used for capturing global information of the skeleton time-space domain. However, LSTM does not obtain enough global information due to the serial (step-by-step) approach. Although the transform-based approach captures global information well, it has a computational complexity squared with tokens, which is not practical for use on devices with limited computational resources. Disclosure of Invention The application provides a model training method based on skeleton action recognition, a skeleton action recognition method, a skeleton action recognition device and a computer readable storage medium. The application provides a model training method based on skeleton action recognition, which comprises the following steps: acquiring a skeleton video to be trained; extracting local characteristics of the skeleton video to be trained by using a local information modeling module of the skeleton action recognition model; Acquiring a first local feature, a second local feature and a third local feature of the local feature corresponding to three feature dimensions; Performing convolution operations of other feature dimensions on the first local feature, the second local feature and the third local feature by using a global information modeling module of the skeleton motion recognition model so as to acquire a first global feature corresponding to the first local feature, a second global feature corresponding to the second local feature and a third global feature corresponding to the third local feature; Fusing the first global feature, the second global feature and the third global feature to obtain global features, and fusing the local features and the global features to obtain fusion features; Inputting the fusion characteristics into a classifier of the skeleton action recognition model to obtain the prediction category of the skeleton video to be trained; and training the skeleton action recognition model based on the prediction category and the label category of the skeleton video to be trained. Wherein the convolution operation of the other feature dimensions is a position sensitive convolution operation; The position sensitive convolution operation comprises the steps of determining the self feature dimension and other feature dimensions of a second local feature, obtaining the position codes of skeleton joint points in the self feature dimension and the convolution action dimension, and carrying out convolution processing by utilizing the position codes in the self feature dimension and the convolution action dimension and the splicing features of the second local feature to obtain a second global feature corresponding to the second local feature. The step of obtaining the position coding of the skeleton joint point on the self characteristic dimension and the convolution action dimension comprises the following steps: Acquiring a position index of the convolution dimension and the length of the self characteristic dimension; Acquiring a position index of the skeleton node in the characteristic dimension of the skeleton node; And acquiring the position code of the skeleton joint point according to the position index of the convolution action dimension, the length of the self characteristic dimension and the position index of the self characteristic dimension. The convolution processing is performed by using the position codes in the self feature dimension and the convolution function dimension and the splicing features of the second local feature