CN-121982767-A - Dynamic gesture preprocessing and authentication method, medium and device

CN121982767ACN 121982767 ACN121982767 ACN 121982767ACN-121982767-A

Abstract

The invention provides a dynamic gesture preprocessing and authenticating method, a medium and equipment; the method comprises a preprocessing stage and an authentication stage, wherein the preprocessing stage is used for carrying out hand region segmentation and skeleton information extraction on input video data, carrying out standardization processing on the input video data to obtain a standardized image sequence and a corresponding skeleton sequence, the authentication stage is used for carrying out feature extraction on the standardized image sequence and the corresponding skeleton sequence which are input into an appearance and motion network, the appearance and motion network comprises appearance branches and motion branches, the appearance branches extract appearance features from the standardized image sequence, the motion branches extract motion features from the skeleton sequence, and feature level fusion is carried out through a self-adaptive re-coupling and fusion mechanism to obtain final identity features. According to the method, a double-flow authentication network is adopted, decoupling of appearance and motion characteristics is achieved, targeted extraction and complementary expression of the characteristics are achieved, and therefore authentication efficiency and accuracy are improved.

Inventors

KANG WENXIONG
ZHANG YUFENG
ZENG MING

Assignees

华南理工大学

Dates

Publication Date: 20260505
Application Date: 20251225

Claims (8)

1.A dynamic gesture preprocessing and authentication method is characterized by comprising a preprocessing stage and an authentication stage; The preprocessing stage comprises the steps of carrying out hand region segmentation and skeleton information extraction on input video data, carrying out standardization processing on the input video data to obtain a standardized image sequence and a corresponding skeleton sequence; The authentication stage comprises the steps of inputting a standardized image sequence and a corresponding skeleton sequence into an appearance and motion network for feature extraction, wherein the appearance and motion network comprises appearance branches and motion branches, the appearance branches extract appearance features X A from the standardized image sequence, the motion branches extract motion features X M from the skeleton sequence, finally, feature level fusion is carried out through a self-adaptive re-coupling and fusion mechanism to obtain final identity features Y F , and dynamic gesture authentication results are obtained according to the identity features Y F .
2. The method for dynamic gesture preprocessing and authentication as recited in claim 1, wherein said authentication phase comprises: sampling N A frames of images from the standardized image sequence of N frames at a sampling rate d to obtain a video image sequence with dimensions of BxN A xCxHxW, wherein B is the number of samples, hxW is the image resolution, then inputting the video image sequence into appearance branches of an appearance and motion network, and extracting trunks by taking a pre-trained ResNet network as a characteristic to obtain appearance characteristics Wherein C A represents the number of channels, and h and w represent the height and width of the feature, respectively; The method comprises the steps of embedding a skeleton sequence into a space-time diagram structure to obtain a diagram embedding result, wherein nodes of the diagram embedding result represent hand key points, edges of the diagram embedding result are established according to a hand structure connection relation and an inter-frame time relation, the diagram embedding result is input into a motion branch of an appearance and motion network, the motion branch adopts a self-adaptive diagram convolution network as a feature extraction trunk, and motion features are output Wherein, C M represents the number of channels, N M represents the number of frames, N M < N, J represents the number of hand key points.
3. The method for preprocessing and authenticating a dynamic gesture according to claim 2, wherein the adaptive re-coupling and fusion mechanism is implemented by using a re-coupling module and an adaptive fusion module; The input of the re-coupling module is appearance characteristic And movement characteristics Wherein, C A and C M represent the channel number, N A and N M represent the frame number, h and w are the space dimensions, and the characteristic X' A is adjusted by the unified operation of the frame number: ; Wherein, the U () is the unified operation of the frame number, RP () is repeated filling, DS () is downsampling; The absolute position is embedded into PE and added into appearance feature X A , and query vector Q, key vector K and value vector V are generated: ; ; ; wherein ConvQ (), convK (), convV () are convolutions of the query, key, and value, respectively; after dimension transformation, the attention affinity matrix corresponding to the N M frames is obtained : ; Wherein a n,i,j represents the correlation between the ith pixel feature and the jth hand key point in the nth frame, and obtaining appearance semantic features S for re-coupling: ; Remolding the appearance semantic features S into the same dimension as the original motion features X M and enhancing the original motion features X M in a residual connection mode to obtain enhanced motion features : ; Enhancing motion characteristics After global pooling and full-connection layer channel expansion, a motion characteristic vector Y M is obtained; Compressing space and time dimensions by the appearance feature X A through global space average pooling and global time average pooling operation to obtain an appearance feature vector Y A ; The self-adaptive fusion module splices the appearance characteristic vector Y A and the motion characteristic vector Y M according to weight to obtain a final identity characteristic Y F : ; Wherein the weight is And (3) with Dynamically generated by the fully connected layer FC: ; wherein Concat () is a splicing operation.
4. The method for dynamic gesture preprocessing and authentication as recited in claim 3, wherein the loss function is defined as follows: ; Where L AMS represents the weight of the corresponding loss term using AMSoftmax loss function, α and β respectively.
5. The method for preprocessing and authenticating the dynamic gesture according to claim 1, wherein the preprocessing stage comprises the steps of carrying out hand region segmentation and skeleton information extraction on input video data by adopting an SP-Net model, outputting a hand segmentation result and a skeleton sequence, wherein each hand key point is represented by two-dimensional space coordinates, and carrying out standardized processing on original video data by combining the hand segmentation result and the skeleton sequence by a GE-Stan module, so that consistency of space-time distribution is ensured without changing video shapes.
6. The method for preprocessing and authenticating the dynamic gesture according to claim 5, wherein the step of normalizing the original video data by the GE-Stan module comprises background removal, illumination calibration, position normalization, angle normalization and scale normalization; Removing the background, namely removing irrelevant background areas of original video data by using a hand segmentation result, and only reserving the hand areas; the illumination calibration is to calculate the average brightness value b c of the current hand area and compare with the target brightness level b o according to the proportionality coefficient Brightness adjustment is carried out on the image; the position standardization means that through translation operation, a skeleton sequence is utilized to align a palm root joint point of an image after illumination calibration to the bottom center position of an image frame; The angle standardization is to align a connecting line connecting the palm root and the middle finger root with a vertical axis of an image by taking the palm root as a rotation center, so that gesture directions are unified, and distribution deviation caused by gesture rotation is reduced; The dimension standardization means that dimension change caused by different distances between hands and cameras is eliminated by adjusting the center line of the palm to be a fixed length.
7. A readable storage medium, wherein the storage medium has stored thereon a computer program which, when executed by a processor, causes the processor to perform the dynamic gesture preprocessing and authentication method of any one of claims 1-6.
8. A computer device comprising a processor and a memory for storing a program executable by the processor, wherein the processor, when executing the program stored in the memory, implements the dynamic gesture preprocessing and authentication method of any one of claims 1-6.

Description

Dynamic gesture preprocessing and authentication method, medium and device Technical Field The present invention relates to the field of biometric identification technologies, and in particular, to a dynamic gesture preprocessing and authentication method, medium, and apparatus. Background The biological characteristic authentication technology has obvious advantages compared with the traditional method under the drive of large development of the artificial intelligence algorithm, and has great development potential. However, the existing biometric authentication technology still has one or more of the problems of single characteristic nature, weak counterfeit attack tolerance, unfriendly data acquisition mode and the like, so that the user experience is poor. Part of biological characteristic authentication technology also touches user information privacy, and popularization difficulty is high. Dynamic gestures are biological features with great development potential, and have physiological features (such as palm shape and finger length) and behavioral features (such as strength and flexibility), so that dynamic gesture authentication has higher accuracy and safety theoretically. However, since dynamic gesture authentication belongs to a fine-grained video understanding task, the algorithm design is very challenging, and the model needs to find fine physiological differences from a large number of similar people and weak motion pattern differences from similar behavior patterns. In many cases, the difference caused by the environmental interference is much larger than the difference between different identities, so that the authentication algorithm is easily affected by the environment. The existing authentication method is tested in a fixed scene, the influence of a complex environment in actual conditions is not considered, and when the application environment is changed, the identity of a user cannot be accurately identified by the existing authentication algorithm, so that the method is difficult to be applied practically. Dynamic gesture authentication based on video understanding involves spatiotemporal feature extraction. The mainstream space-time feature extraction network at present comprises a three-dimensional convolutional neural network, a double-flow convolutional neural network and a two-dimensional convolutional neural network. Wu et al first introduced a dual-flow convolutional neural network to gesture authentication and employed optical flow for representation of behavioral characteristics (TSCNN). Considering that the optical flow extraction efficiency is low and the effect is poor, song et al adopts PA to replace the optical flow (ITSCNN) on the basis, and greatly reduces the error rate of authentication and the like. However, since the dual-flow network has twice the number of parameters and the operation amount, the operation efficiency still has a large improvement space. In addition, liu et al designed DHGA-Net on the basis of a three-dimensional convolutional neural network and achieved an ideal authentication effect. However, three-dimensional convolution also has problems of large parameters and low operation efficiency. Meanwhile, as the physiological characteristics and the behavior characteristics cannot be processed in a targeted manner through three-dimensional convolution, a large improvement space is still provided in the aspect of authentication accuracy. In order to alleviate the problems of large operand and low efficiency of the double-flow network and the three-dimensional network, song et al customize TDS-Net and 3DTDS-Net for gesture authentication. According to the two methods, the targeted analysis of the behavior characteristics is carried out by attaching a lightweight symbiotic branch on the basis of the high-efficiency two-dimensional convolutional neural network, and the complementary fusion of the physiological characteristics and the behavior characteristics is realized through the BE characteristic fusion module and the BM characteristic fusion module respectively, so that the problem of large calculation amount of the three-dimensional network and the two-dimensional network is solved to a certain extent, and meanwhile, the authentication accuracy is further improved. However, in TDS-Net and 3DTDS-Net, gesture behavior feature analysis is severely limited by features provided by the backbone network and limited nonlinear expression capabilities of symbiotic branches, so gesture behavior features are not well exploited. Song et al have subsequently proposed DwTNl-Net and FSTA-Net, both of which utilize behavior pseudo-modal plus attention mechanisms to enhance the modeling capability of behavior features in the time dimension. Wherein DwTNL-Net proposes TSClip as a behavior pseudo-modal representation, and FSTA-Net uses BM-Map to highlight behavior features. In terms of global time domain information extraction, dwTNL-Net and FSTA-Net integrate loca