CN-122023720-A - Single-hand reconstruction method based on gesture decoupling

CN122023720ACN 122023720 ACN122023720 ACN 122023720ACN-122023720-A

Abstract

The invention discloses a single-hand reconstruction method based on gesture decoupling, which comprises the steps of obtaining an RGB image containing a hand, carrying out data enhancement processing on the RGB image to obtain a plurality of enhanced images, inputting the enhanced images into a single-hand reconstruction model to obtain a hand reconstruction result, wherein the single-hand reconstruction model comprises a gesture decoupling module, a feature coding module and a three-dimensional reconstruction module, the gesture decoupling module is used for extracting hand basic features of the enhanced images and mapping the hand basic features into gesture related features and gesture unrelated features, the feature coding module is used for respectively calculating gesture related contrast loss and gesture unrelated contrast loss based on the gesture related features and the gesture unrelated features, and the three-dimensional reconstruction module is used for estimating hand three-dimensional joint point coordinates and hand three-dimensional grid vertex coordinates based on the gesture related contrast loss and the gesture unrelated contrast loss and outputting the hand reconstruction result.

Inventors

FANG YUCHUN
XU YUTAO
JIN CHENG
CAO YITING

Assignees

上海大学

Dates

Publication Date: 20260512
Application Date: 20260131

Claims (9)

1. A method of one-hand reconstruction based on gesture decoupling, comprising: Acquiring an RGB image containing a hand, performing data enhancement processing on the RGB image, and acquiring a plurality of enhanced images; inputting the enhanced image into a single-hand reconstruction model to obtain a hand reconstruction result, wherein the single-hand reconstruction model comprises a gesture decoupling module, a feature encoding module and a three-dimensional reconstruction module; The gesture decoupling module is used for extracting hand basic features of the enhanced image and mapping the hand basic features into gesture related features and gesture irrelevant features; the feature coding module is used for respectively calculating the posture-related contrast loss and the posture-independent contrast loss based on the posture-related features and the posture-independent features; the three-dimensional reconstruction module is used for estimating hand three-dimensional joint point coordinates and hand three-dimensional grid vertex coordinates based on the posture-related contrast loss and the posture-independent contrast loss and outputting a hand reconstruction result.
2. The gesture decoupling-based one-hand reconstruction method of claim 1, wherein extracting hand base features of the enhanced image and mapping the hand base features to gesture-related features comprises: inputting the enhanced image into a backbone network for processing to obtain a fine-grained feature map; deconvolution up-sampling is carried out on the fine-granularity feature map to obtain a high-resolution feature map; carrying out average pooling and convolution processing on the high-resolution feature images in different directions to obtain one-dimensional heat maps for key point detection in different directions; Extracting hand two-dimensional joint point coordinates from the one-dimensional heat map through soft-argmax, and converting the hand two-dimensional joint point coordinates into a Gaussian heat map; And carrying out feature projection on the Gaussian heat map to obtain the posture related features.
3. The gesture decoupling-based one-hand reconstruction method of claim 2, further comprising, prior to converting the hand two-dimensional node coordinates to a gaussian heat map: Optimizing the hand two-dimensional joint coordinates using a two-dimensional pose estimation penalty, wherein the two-dimensional pose estimation penalty The method comprises the following steps: ; Wherein, the J represents the j-th articulation point, Representing the predicted coordinates of the j-th node point, Representing the corresponding real coordinates of the object, Is the L1 norm.
4. The gesture decoupling-based one-hand reconstruction method of claim 1, wherein extracting hand base features of the enhanced image and mapping the hand base features to gesture-independent features comprises: inputting the enhanced image into a backbone network for processing to obtain a fine-grained feature map; Carrying out global average pooling treatment on the fine-grained feature map to obtain feature vectors; And inputting the feature vector into a projection head comprising two full-connection layers and a batch normalization layer, and mapping the feature vector into the posture-independent feature.
5. A method of one-hand reconstruction based on pose decoupling as claimed in claim 3, wherein the pose dependent contrast loss is calculated Comprising the following steps: ; ; Wherein, the Is the temperature super-parameter, the temperature is higher than the temperature, Is cosine similarity, is used to measure the similarity between two feature vectors, A representation is embedded for the gesture-related features, Expressed in terms of Is an anchor point, The loss of contrast term is correlated for the pose of a positive sample, Expressed in terms of Is an anchor point, A posture-related contrast loss term for positive samples, M is a negative sample index, M is the total number of negative samples participating in contrast learning, () Is an exponential function.
6. The method for reconstructing a single hand based on pose decoupling as recited in claim 5, wherein the pose independent contrast loss is calculated Comprising the following steps: ; Wherein, the A representation is embedded for the gesture-independent feature, Expressed in terms of Is an anchor point, The loss term is not related to the pose of the positive sample, Expressed in terms of Is an anchor point, The loss term is not related to the pose of the positive sample, Indicating an indication function when When 1 is taken, otherwise 0 is taken, Is a negative pair of samples, N represents the pair number of samples involved in contrast learning.
7. The method of claim 6, wherein estimating hand three-dimensional joint point coordinates and hand three-dimensional mesh vertex coordinates based on the posture-dependent contrast loss and posture-independent contrast loss, and outputting a hand reconstruction result comprises: optimizing the feature coding module by utilizing the posture-related contrast loss and the posture-independent contrast loss to obtain an optimized feature coding module; And transferring the optimized feature coding module to the three-dimensional reconstruction module, estimating three-dimensional joint point coordinates of the hand and three-dimensional grid vertex coordinates of the hand by using the three-dimensional reconstruction module, and outputting a hand reconstruction result.
8. The method for reconstructing the single hand based on the gesture decoupling of claim 7, wherein the three-dimensional reconstruction module comprises a gesture regression unit, a grid regression unit, a projection unit and a mapping unit; The gesture regression unit is used for predicting the 2.5D coordinates of the hand joint point according to the output characteristic regression of the characteristic coding module; The grid regression unit is used for predicting the 2.5D coordinates of the hand grid according to the 2.5D coordinates; The projection unit is used for back-projecting the 2.5D coordinates of the hand grid to a three-dimensional space to obtain three-dimensional grid vertex coordinates; The mapping unit is used for mapping the three-dimensional grid vertex coordinates into hand three-dimensional joint point coordinates according to a preset joint point regression matrix, and taking the hand three-dimensional joint point coordinates as a hand reconstruction result.
9. The gesture decoupling-based one-hand reconstruction method of claim 8, wherein the reconstruction method further comprises: And constructing a total loss function according to the two-dimensional attitude estimation loss, the attitude-related comparison loss and the attitude-independent comparison loss, and optimizing the single-hand reconstruction model by adopting a weighted summation mode.

Description

Single-hand reconstruction method based on gesture decoupling Technical Field The invention belongs to the technical field of image information processing and computer vision, and particularly relates to a one-hand reconstruction method based on gesture decoupling. Background With rapid development of technologies such as virtual reality, augmented reality, man-machine interaction, smart home, wearable equipment and the like, three-dimensional reconstruction of hands has become an important research direction in the field of computer vision. The hand is used as one of the interaction components with the most flexible human body and the highest information quantity, and the gesture, the action and the shape of the hand can be directly used for expressing instructions, controlling an interface, assisting in viewing medical images, enhancing immersive experience and the like. The three-dimensional structure of the hand can be accurately restored in real time, and the method has important significance for improving interaction naturalness and system operability. Traditional hand three-dimensional reconstruction typically relies on specialized hardware such as multi-view imaging systems, depth cameras, or wearable data gloves. Although the method can provide more accurate three-dimensional information, the method has the advantages of high cost, complex equipment and limited use, and is difficult to meet the requirements of common consumer-grade equipment. In recent years, with the development of deep learning, a method of estimating a three-dimensional structure of a hand from a single RGB image has been a mainstream research direction. However, this direction still faces a number of technical bottlenecks. First, the hand as a highly non-rigid structure contains more than twenty degrees of freedom, and bending, self-shielding and complex gesture changes of different fingers can cause a great deal of deformation and fuzzy information in a two-dimensional image, which brings great difficulty to three-dimensional reconstruction. Second, existing deep learning methods generally rely on a large amount of data with three-dimensional annotations for training, while high-quality three-dimensional annotations often require expensive multi-camera systems and fine annotation processes, resulting in problems of data scarcity and high acquisition cost. In an actual scene, only a small amount of data with 3D labels can be obtained, so that the traditional method mainly based on supervised learning is difficult to train a three-dimensional reconstruction model with generalization capability. In order to alleviate the problem of scarcity of three-dimensional labeling, a self-supervision and contrast learning method is gradually introduced into a hand three-dimensional reconstruction task. The comparison learning improves the discrimination capability of the model by constructing positive and negative sample pairs, but the existing method is often directly used for comparison optimization by using integral visual characteristics, and is difficult to distinguish between 'information related to gestures' and 'information related to appearances'. Because of a large amount of appearance differences of hand images in the aspects of illumination, background, skin color, texture and the like, if a model cannot decouple gesture features from appearance factors, the problems of unstable gesture representation and insufficient generalization capability can be caused. Therefore, a method for automatically learning discriminant characterization related to the gesture from an unlabeled image and explicitly eliminating interference of gesture-independent factors such as illumination, background, texture and the like is needed, so that accurate three-dimensional gesture and grid reconstruction of the hand can be realized under the condition of insufficient three-dimensional labeling data. Disclosure of Invention In order to solve the technical problems, the invention provides a one-hand reconstruction method based on gesture decoupling, which enables a model to learn stable gesture distinguishing features on a large number of unlabeled images by explicitly introducing a decoupling mechanism of gesture related features and gesture irrelevant features in a comparison learning frame, and can still obtain a high-precision three-dimensional reconstruction result when a small amount of labeled data is used for fine adjustment in the follow-up process. In order to achieve the above object, the present invention provides a one-hand reconstruction method based on gesture decoupling, comprising: Acquiring an RGB image containing a hand, performing data enhancement processing on the RGB image, and acquiring a plurality of enhanced images; inputting the enhanced image into a single-hand reconstruction model to obtain a hand reconstruction result, wherein the single-hand reconstruction model comprises a gesture decoupling module, a feature encoding module