CN-121564806-B - Open-set skeleton action recognition method and device based on outlier prototype learning

CN121564806BCN 121564806 BCN121564806 BCN 121564806BCN-121564806-B

Abstract

The application provides an open-set skeleton action recognition method and device based on outlier prototype learning. The method comprises the steps of constructing a neural network model, preprocessing human skeleton data, extracting network to obtain initial characteristics through multiple branch characteristics, processing the initial characteristics through a classifier and a hypersphere characteristic mapper to obtain action category logic predicted values and branch characteristics, training the neural network model based on training, synthesizing virtual outliers in the sample characteristic space in the optimized distribution after iteration for a first iteration number through multiple types of loss optimization, optimizing energy boundaries by combining samples and virtual outliers in the sample characteristic space in the optimized distribution, screening optimal model weights according to open set identification and closed set classification comprehensive performance of a verification set after iteration for a second iteration number, judging action types and triggering corresponding operation of a robot based on energy scores and action classification results output by the trained neural network model.

Inventors

WANG BAICUN
SONG CI
BAO JINSONG
Zheng pai
XIAO XIAO
YANG CHENLONG

Assignees

浙江大学

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (10)

1. An open-set skeleton action recognition method based on outlier prototype learning, which is characterized by comprising the following steps: Constructing a skeleton action recognition neural network model; preprocessing human skeleton data, and acquiring initial characteristics through a multi-branch characteristic extraction network; The initial characteristics are respectively processed through a classifier and a hypersphere characteristic mapper to obtain action category logic predicted values and branch characteristics in a unified hypersphere characteristic space, wherein the action category logic predicted values are probability or probability values which are output after the initial characteristics are processed by the classifier and are used for judging that samples belong to different action categories; Dividing the human skeleton data into a training set and a verification set, training the skeleton action recognition neural network model based on the training set, optimizing sample feature space in distribution through multiple types of losses after iteration for a first iteration number, synthesizing virtual outliers in the sample feature space in the optimized distribution, optimizing an energy boundary by combining samples in the distribution and the virtual outliers, and screening optimal model weights according to the comprehensive performance of open set recognition and closed set classification of the verification set after iteration for a second iteration number; And inputting the artificial skeleton data acquired in real time into a skeleton action recognition neural network model after training, judging the action type and triggering the corresponding operation of the robot through the energy score and the action classification result output by the neural network model.
2. The method of claim 1, wherein synthesizing virtual outliers in the sample feature space in the optimized distribution comprises: In the hypersphere feature space subjected to multi-class loss optimization, calculating the distance between the hypersphere feature of each sample in the distribution in each action class and the hypersphere feature of all other samples in the same class according to each action class, selecting a first preset number of distance values with the largest distance value, and determining the samples corresponding to the distance values as candidate boundary points; calculating the average distance between the candidate boundary points and all samples of the same class, and selecting the sample with the largest average distance as a class boundary point; Acquiring hypersphere characteristics of samples in the distribution of other action categories, and calculating the difference between the category boundary points and the hypersphere characteristics of the samples of other categories; Amplifying the difference according to a preset proportion by taking the category boundary point as a reference, and adding the amplified difference with the characteristics of the category boundary point to generate a new characteristic point; And taking the generated new feature points as virtual outliers, wherein the virtual outliers are positioned outside the boundary of the sample feature space in the distribution.
3. The method of claim 1, wherein the combining samples within the distribution with virtual outliers optimizes energy boundaries comprising: Aiming at the distributed samples and the synthesized virtual outliers in the training set, respectively according to the logic predicted values of all action categories output by the classifier, carrying out exponential conversion on each logic predicted value, then summing, taking the logarithm of the summation result and taking the negative value to obtain the energy score of each distributed sample and each virtual outlier; Counting the energy scores of all the samples in the distribution in the training set according to a preset period, sorting the energy scores, selecting a score value corresponding to a specific score, and determining the score value as a current distribution energy boundary; setting a preset external energy distribution boundary, wherein the external energy distribution boundary is higher than the internal energy distribution boundary; calculating a first portion of the energy score of the sample within the distribution that exceeds the energy boundary within the distribution, and a second portion of the energy score of the virtual outlier that is below the energy boundary outside the distribution; And respectively taking positive values of the exceeding or shortage values of the first part and the second part, calculating the average value of positive values of samples and virtual outliers in all the distribution, using the average value as an energy fraction loss function value, updating skeleton actions through the back propagation of the energy fraction loss function to identify neural network model parameters, and gathering the energy fraction of the samples in the updated distribution below an energy boundary in the distribution and gathering the energy fraction of the virtual outliers above an energy boundary outside the distribution.
4. The method of claim 1, wherein the multiple classes of losses include cross entropy losses, intra-modality prototype contrast losses, and inter-modality prototype contrast losses, the cross entropy loss calculation process comprising: counting the total number of samples and the total number of action categories in the distribution of the training set; determining a real action category of each intra-distribution sample in a training set; obtaining the prediction probability of the samples in the distribution output by the classifier belonging to all action categories; taking the logarithm of the prediction probability corresponding to the real action category and taking a negative value for each sample in the distribution; And calculating the average value of the negative values of the samples in all the distributions to obtain a cross entropy loss value.
5. The method of claim 4, wherein the calculation of intra-modality prototype contrast loss comprises: Determining class prototypes of all action classes, wherein the historical class prototypes are fused with hypersphere features of samples in the similar distribution in the current training set according to a preset proportion, and normalization processing is carried out on fusion results; Aiming at each sample in the distribution in the training set, extracting the hypersphere characteristics of the sample in the distribution, and determining the class prototype of the action category and the class prototypes of all other action categories; Calculating first similarity between the hypersphere characteristics of the samples in the distribution and the class prototypes and second similarity between the hypersphere characteristics of the samples in the distribution and each class prototype; converting the first similarity into an exponential form as a numerator, taking the sum of the numerator and the exponential forms of all the second similarity as a denominator, and dividing the numerator by the denominator to obtain the class attribution probability of the samples in the distribution; and taking logarithm and taking negative value of class attribution probability of the samples in each distribution, and calculating average value of the negative values of the samples in all the distributions to obtain a model contrast loss value in the mode.
6. The method of claim 4, wherein the calculation of the inter-modality prototype contrast loss comprises: obtaining class prototype sets of all branches in a multi-branch feature extraction network, wherein the class prototype set of each branch comprises class prototypes of all action categories; For each two different branches, calculating the similarity between the class prototype of each action category in the first branch and the class prototypes of all action categories in the second branch; Performing exponential conversion on the similarity between the class prototype of the target action class in the first branch and the class prototype of the same action class in the second branch to obtain molecules; Taking the sum of index conversion results of the similarity between the molecular and the class prototype of the target action class in the first branch and the class prototype of all the action classes in the second branch as a denominator, and dividing the numerator by the denominator to obtain the cross-branch matching probability of the target action class; Taking logarithm and taking negative value of cross-branch matching probability of each target action category, and calculating average value of negative values of all target action categories to obtain prototype comparison loss between two branches; And calculating prototype comparison losses of all different branch combinations, and adding all the prototype comparison losses to obtain a prototype comparison total loss value among modes.
7. The method of claim 1, wherein the skeletal action recognition neural network model comprises a data preprocessing module, a multi-branch feature extraction network, a classifier, and a hypersphere feature mapper; The data preprocessing module is used for executing normalization processing on the received human skeleton data, disassembling the processed data into joint point position data, joint point speed data and skeleton data, and respectively conveying the joint point position data, the joint point speed data and the skeleton data to corresponding branches of the multi-branch feature extraction network; The multi-branch feature extraction network comprises three parallel feature extraction branches, each branch takes a graph neural network as a backbone network, the three branches are used for respectively receiving joint point position data, joint point speed data and skeleton data, and the graph neural network is used for extracting the spatial topological relation and the motion features of various data and outputting initial features corresponding to the branches; The classifier comprises three sub-classifiers which are in one-to-one correspondence with the branches of the multi-branch feature extraction network, each sub-classifier is used for receiving initial features output by the corresponding branch, performing category mapping operation on the initial features, and outputting logic predicted values of samples belonging to different action categories under the corresponding branch; The input end of the hypersphere feature mapper is used for receiving initial features output by three branches, converting the three types of initial features into the same hypersphere feature space through unified feature mapping operation, and outputting hypersphere features corresponding to each branch.
8. The method of claim 1, wherein the preprocessing of human skeleton data comprises: Acquiring a human skeleton data sequence to be processed, and identifying and extracting three-dimensional space coordinates of a root joint from the 0 th frame data of the human skeleton data sequence as reference coordinates; For target frame data in the human skeleton data sequence, subtracting the three-dimensional space coordinates of the 0 th frame root joint from the three-dimensional space coordinates of each joint point in the target frame to obtain the relative three-dimensional space coordinates of each joint point in the target frame relative to the 0 th frame root joint; And repeatedly executing a coordinate transformation process for each frame of data to obtain normalized human skeleton data.
9. The method according to claim 1, wherein the determining the type of the action and triggering the corresponding operation of the robot by the energy score and the action classification result output by the neural network model comprises: Obtaining the energy fraction output by the neural network model and the logic predicted value of the multi-branch feature extraction network; Comparing the energy score with a predefined judgment threshold, and judging that the corresponding action belongs to an unknown action if the energy score is larger than the judgment threshold, wherein the robot is not triggered to perform any operation; And if the energy score is smaller than the judging threshold, determining an action type based on the logic predicted value, and triggering the robot to execute an operation corresponding to the action type.
10. An open-set skeleton action recognition device based on outlier prototype learning is characterized by comprising a construction module, a processing module, an optimization module and a recognition module; the framework action recognition neural network model comprises a data preprocessing module, a multi-branch feature extraction network, a classifier and a hypersphere feature mapper; the processing module is used for preprocessing human skeleton data and acquiring initial characteristics through the multi-branch characteristic extraction network; The processing module is also used for respectively processing the initial characteristics through the classifier and the hypersphere characteristic mapper to obtain action category logic predicted values and branch characteristics in the unified hypersphere characteristic space, wherein the action category logic predicted values are probability or probability values which are output after the classifier processes the initial characteristics and are used for judging that samples belong to different action categories; the optimization module is used for dividing the human skeleton data into a training set and a verification set, training the skeleton action recognition neural network model based on the training set, optimizing sample feature space in distribution through multiple types of losses after iteration for a first iteration number, synthesizing virtual outliers in the sample feature space in the distribution after optimization, combining the samples in the distribution and the virtual outliers to optimize an energy boundary, and screening optimal model weights according to the open set recognition and closed set classification comprehensive performance of the verification set after iteration for a second iteration number; The recognition module is used for inputting the real-time collected artificial skeleton data into the training skeleton action recognition neural network model, judging the action type and triggering the corresponding operation of the robot through the energy score and the action classification result output by the neural network model.

Description

Open-set skeleton action recognition method and device based on outlier prototype learning Technical Field The application relates to the technical field of motion recognition, in particular to an open-set skeleton motion recognition method and device based on outlier prototype learning. Background Under the man-machine cooperation assembly scene, because the product structure is complicated various, rely on the manual assembly completely and exist efficiency bottleneck, and the full-automatic assembly faces restriction such as flexibility inadequacy and cost higher again. Therefore, by combining the flexible adaptability of human beings and the operation repeatability of robots, an efficient man-machine cooperation assembly mode is constructed, and the method becomes an effective way for realizing flexible production and improving the manufacturing efficiency. In this process, the perception and understanding of the operator's actions by the robot is central to achieving a high level of collaboration. At present, human body action recognition methods in human-computer collaboration systems are mainly divided into two types, namely a method based on video/image input and a method based on skeleton input. The former relies on video frames or image sequences to identify human body actions by means of convolutional neural networks, space-time feature modeling and the like, but has limitations in illumination variation, occlusion, privacy protection and the like. The method based on skeleton input utilizes the space-time sequence data of human joints to model, however, the existing human motion recognition research based on skeleton input is mostly focused on a closed set scene, namely, motion categories are consistent in training and testing stages. During the actual human-computer collaborative assembly process, an operator may have undefined actions or abnormal behaviors, which results in the recognition scene to appear open. Because the research on the open-set human motion recognition based on skeleton input is relatively less, the existing method still has the defect in the aspect of open-set detection precision, and the requirements of complex man-machine cooperation application on the safety and reliability of motion recognition are difficult to meet. Therefore, a method is needed to improve the accuracy of the open-set skeleton motion recognition. Disclosure of Invention In view of the above, the present application provides an open-set skeleton motion recognition and device based on outlier prototype learning, which is used for improving the accuracy of open-set skeleton motion recognition. Specifically, the application is realized by the following technical scheme: the first aspect of the application provides an open-set skeleton action recognition method based on outlier prototype learning, which comprises the following steps: Constructing a skeleton action recognition neural network model; preprocessing human skeleton data, and acquiring initial characteristics through a multi-branch characteristic extraction network; the initial characteristics are respectively processed through a classifier and a hypersphere characteristic mapper to obtain action category logic predicted values and branch characteristics in a unified hypersphere characteristic space; Dividing the human skeleton data into a training set and a verification set, training the skeleton action recognition neural network model based on the training set, optimizing sample feature space in distribution through multiple types of losses after iteration for a first iteration number, synthesizing virtual outliers in the sample feature space in the optimized distribution, optimizing an energy boundary by combining samples in the distribution and the virtual outliers, and screening optimal model weights according to the comprehensive performance of open set recognition and closed set classification of the verification set after iteration for a second iteration number; And inputting the artificial skeleton data acquired in real time into a skeleton action recognition neural network model after training, judging the action type and triggering the corresponding operation of the robot through the energy score and the action classification result output by the neural network model. The application provides an open-set skeleton action recognition device based on outlier prototype learning, which comprises a construction module, a processing module, an optimization module and a recognition module; the framework action recognition neural network model comprises a data preprocessing module, a multi-branch feature extraction network, a classifier and a hypersphere feature mapper; the processing module is used for preprocessing human skeleton data and acquiring initial characteristics through the multi-branch characteristic extraction network; The processing module is further used for respectively processing the initial characteristics through the classif