CN-121582972-B - Hand detection method and device, nonvolatile storage medium and electronic equipment

CN121582972BCN 121582972 BCN121582972 BCN 121582972BCN-121582972-B

Abstract

The application discloses a hand detection method and device, a nonvolatile storage medium and electronic equipment. The method comprises the steps of processing an image by using a computer vision model to obtain a plurality of prediction frames and left-hand confidence scores and right-hand confidence scores corresponding to the prediction frames, marking the left-hand prediction frames and the right-hand prediction frames in the prediction frames, copying target prediction frames to obtain a first copy and a second copy, marking the first copy as the left-hand prediction frames and the second copy as the right-hand prediction frames, performing feature modulation processing on a left-hand prediction frame set to obtain key point coordinates corresponding to hand areas in the left-hand prediction frames, and performing feature modulation processing on a right-hand prediction frame set to obtain key point coordinates corresponding to the hand areas in the right-hand prediction frames. The application solves the technical problems of detection confusion and key point identity confusion caused by difficulty in accurately distinguishing left and right hands in hand detection and hand key point positioning.

Inventors

HE ZHONGJIANG
ZUO QING
LI XUELONG
SUN HAO

Assignees

中电信人工智能科技(北京)有限公司

Dates

Publication Date: 20260512
Application Date: 20260126

Claims (11)

1. A hand detection method, comprising: acquiring an image to be processed; Processing the image by using a computer vision model to obtain a plurality of prediction frames for a hand region in the image and a left-hand confidence score and a right-hand confidence score corresponding to each prediction frame, wherein the left-hand confidence score is used for quantifying the probability of representing that the hand region is left-hand, and the right-hand confidence score is used for quantifying the probability of representing that the hand region is right-hand; In the plurality of prediction frames, for the prediction frame with the left-hand confidence score being greater than a first threshold value and the right-hand confidence score being less than a second threshold value, assigning the identity mark belonging to the left hand to the prediction frame to obtain a left-hand prediction frame, and for the prediction frame with the right-hand confidence score being greater than the second threshold value and the left-hand confidence score being less than the first threshold value, assigning the identity mark belonging to the right hand to the prediction frame to obtain a right-hand prediction frame; copying a target prediction frame to obtain a first copy and a second copy, marking the first copy as the left-hand prediction frame, and marking the second copy as the right-hand prediction frame, wherein the target prediction frame is a prediction frame with the left-hand confidence score greater than the first threshold and the right-hand confidence score greater than the second threshold; Performing feature modulation processing on a left-hand prediction frame set comprising a plurality of left-hand prediction frames by using a gesture estimation model to obtain key point coordinates corresponding to a hand region in the left-hand prediction frames, wherein the key point coordinates comprise that the left-hand prediction frames are used as centers to cut the image, and the cut image is scaled to a preset size to obtain a first image; the method comprises the steps of obtaining a first image feature map with space dimension by utilizing a feature extraction backbone network, converting mark information in a left-hand prediction frame into a first feature vector by utilizing an identity encoder, adjusting the first feature vector into a first feature tensor consistent with the space dimension of the first image feature map through shape change operation, generating a first channel weighting parameter and a first space attention parameter based on the first feature tensor, carrying out channel dimension weighting on the first image feature map by utilizing the first channel weighting parameter, carrying out space attention dimension modulation on the first image feature map by utilizing the first space attention parameter to obtain a first target feature map, carrying out up-sampling operation on the first target feature map by utilizing a key point decoder, recovering the space resolution of the first target feature map to obtain a first key point heat map, and determining key point coordinates corresponding to a hand area in the left-hand prediction frame based on the first key point heat map; And carrying out feature modulation processing on a right-hand prediction frame set comprising a plurality of right-hand prediction frames by using the gesture estimation model to obtain key point coordinates corresponding to a hand region in the right-hand prediction frame, wherein the marking information in the left-hand prediction frame set and the right-hand prediction frame set is used for modulating the feature expression of the middle layer of the gesture estimation model.
2. The method of claim 1, wherein the computer vision model includes a plurality of parallel classification heads, the classification heads including a left-hand classifier for determining whether the hand region is left-hand and a right-hand classifier for determining whether the hand region is right-hand.
3. The method according to claim 1, wherein performing feature modulation processing on a right-hand prediction frame set including a plurality of the right-hand prediction frames by using the pose estimation model to obtain coordinates of key points corresponding to hand regions in the right-hand prediction frames, includes: cutting the image by taking the right-hand prediction frame as a center, and scaling the cut image to a preset size to obtain a second image; Processing the second image by utilizing a feature extraction backbone network to obtain a second image feature map with space dimension; Converting the marking information in the right-hand prediction frame into a second identity feature vector by using an identity encoder, and adjusting the second identity feature vector into a second identity feature tensor which is two-dimensional with the second image feature map space through a shape changing operation; generating a second channel weighting parameter and a second spatial attention parameter based on the second identity feature tensor; Carrying out channel dimension weighting on the second image feature map by using the second channel weighting parameter, and carrying out space attention dimension modulation on the second image feature map by using the second space attention parameter to obtain a second target feature map; Performing up-sampling operation on the second target feature map by using a key point decoder, and recovering the spatial resolution of the second target feature map to obtain a second key point heat map; and determining the coordinates of key points corresponding to the hand area in the right-hand prediction frame based on the second key point heat map.
4. The method according to claim 1, wherein performing feature modulation processing on a left-hand prediction frame set including a plurality of the left-hand prediction frames by using a pose estimation model to obtain key point coordinates corresponding to a hand region in the left-hand prediction frame, includes: Acquiring a first geometric parameter and first marking information of the left-hand prediction frame; Encoding the first geometric parameter into a first geometric feature vector, wherein the first geometric feature vector comprises position and scale information of the left-hand prediction frame; Encoding the first marking information into a first semantic feature vector, wherein the first semantic feature vector comprises category attributes and confidence information of the left-hand prediction frame; performing fusion processing on the first geometric feature vector and the first semantic feature vector to obtain a first upper part and a first lower part Wen Biaozheng; Processing the first context representation by using a parameter generation network to obtain a first characteristic modulation parameter, wherein the first characteristic modulation parameter comprises a first channel weighting coefficient and a first space transformation matrix; Based on the first geometric parameters, extracting a corresponding first region feature map from the full-map feature map generated by the middle layer of the gesture estimation model; Applying the first channel weighting coefficient to the channel dimension of the first region feature map to recalibrate importance of different feature channels, and applying the first spatial transformation matrix to the spatial dimension of the first region feature map to weight attention to different spatial positions within the left-hand prediction frame; And analyzing the modulated first region feature map by utilizing a subsequent layer of the middle layer of the gesture estimation model to obtain key point coordinates corresponding to a hand region in the left hand prediction frame.
5. The method according to claim 1, wherein performing feature modulation processing on a right-hand prediction frame set including a plurality of the right-hand prediction frames by using the pose estimation model to obtain coordinates of key points corresponding to hand regions in the right-hand prediction frames, includes: acquiring a second geometric parameter and second marking information of the right-hand prediction frame; Encoding the second geometric parameter into a second geometric feature vector, wherein the second geometric feature vector comprises position and scale information of the right-hand prediction frame; Encoding the second tag information into a second semantic feature vector, wherein the second semantic feature vector comprises category attributes and confidence information of the right-hand prediction frame; Performing fusion processing on the second geometric feature vector and the second semantic feature vector to obtain a second upper part and a second lower part Wen Biaozheng; processing the second context representation by using a parameter generation network to obtain a second characteristic modulation parameter, wherein the second characteristic modulation parameter comprises a second channel weighting coefficient and a second space transformation matrix; based on the second geometric parameters, extracting a corresponding second region feature map from the full-map feature map generated by the middle layer of the gesture estimation model; Applying the second channel weighting coefficients to the channel dimensions of the second regional feature map to recalibrate importance of different feature channels, and applying the second spatial transformation matrix to the spatial dimensions of the second regional feature map to weight attention to different spatial locations within the right-hand prediction frame; and analyzing the modulated second region feature map by utilizing a subsequent layer of the middle layer of the gesture estimation model to obtain key point coordinates corresponding to the hand region in the right hand prediction frame.
6. The method of claim 1, wherein prior to feature modulating a left-hand prediction block set comprising a plurality of the left-hand prediction blocks using a pose estimation model, the method further comprises: Performing a first non-maximum suppression operation on the left-hand prediction frames marked as belonging to the left hand to obtain a left-hand prediction frame set; And executing a second non-maximum suppression operation on the right-hand prediction frames marked as belonging to the right hand to obtain the right-hand prediction frame set, wherein the first non-maximum suppression operation and the second non-maximum suppression operation are mutually independent operations.
7. The method of claim 1, wherein the pose estimation model is trained by: determining classification loss according to the summation result of left-hand classification loss and right-hand classification loss, wherein the left-hand classification loss and the right-hand classification loss are obtained by adopting a binary cross entropy loss function to be calculated independently; determining a boundary frame regression loss according to regression differences between the coordinate parameters of the prediction frame and the coordinate parameters of the real hand marking frame; Determining target loss according to the difference between the confidence score of the prediction frame and the target label calculated based on the intersection ratio of the prediction frame and the real hand marking frame; Determining a loss function based on the classification loss, the bounding box regression loss, and the targeting loss; Training the initial attitude estimation model based on a training set, and obtaining the attitude estimation model under the condition that the loss function meets a preset convergence condition.
8. A hand detection device, comprising: the acquisition module is used for acquiring the image to be processed; The processing module is used for processing the image by utilizing a computer vision model to obtain a plurality of prediction frames aiming at a hand area in the image and a left-hand confidence score and a right-hand confidence score corresponding to each prediction frame, wherein the left-hand confidence score is used for quantifying the probability of representing that the hand area is left-hand, and the right-hand confidence score is used for quantifying the probability of representing that the hand area is right-hand; The first marking module is used for giving the identity mark belonging to the left hand to a prediction frame with the left hand confidence score larger than a first threshold value and the right hand confidence score smaller than a second threshold value in the plurality of prediction frames to obtain a left hand prediction frame, and giving the identity mark belonging to the right hand to the prediction frame with the right hand confidence score larger than the second threshold value and the left hand confidence score smaller than the first threshold value to obtain a right hand prediction frame; The second marking module is used for copying a target prediction frame to obtain a first copy and a second copy, marking the first copy as the left-hand prediction frame and marking the second copy as the right-hand prediction frame, wherein the target prediction frame is a prediction frame with the left-hand confidence score being greater than the first threshold and the right-hand confidence score being greater than the second threshold; The first analysis module is used for carrying out feature modulation processing on a left hand prediction frame set comprising a plurality of left hand prediction frames by utilizing a gesture estimation model to obtain key point coordinates corresponding to a hand area in the left hand prediction frames, and comprises the steps of cutting out the image by taking the left hand prediction frames as the center, scaling the cut-out image to a preset size to obtain a first image, processing the first image by utilizing a feature extraction backbone network to obtain a first image feature map with a spatial dimension, converting mark information in the left hand prediction frames into a first identity feature vector by utilizing an identity encoder, adjusting the first identity feature vector into a first identity feature tensor consistent with the spatial dimension of the first image feature map by utilizing a shape change operation, generating a first channel weighting parameter and a first spatial attention parameter based on the first identity feature tensor, weighting the channel dimension of the first image feature map by utilizing the first channel weighting parameter, carrying out spatial attention parameter on the first image feature map to obtain a first image feature map, and carrying out key point operation on the key point map by utilizing the first spatial attention parameter; and the second analysis module is used for carrying out characteristic modulation processing on a right-hand prediction frame set comprising a plurality of right-hand prediction frames by using the gesture estimation model to obtain key point coordinates corresponding to a hand region in the right-hand prediction frame, wherein the marking information in the left-hand prediction frame set and the right-hand prediction frame set is used for modulating the characteristic expression of the middle layer of the gesture estimation model.
9. A non-volatile storage medium, characterized in that the non-volatile storage medium comprises a stored program, wherein the program, when run, controls a device in which the non-volatile storage medium is located to perform the hand detection method of any one of claims 1 to 7.
10. An electronic device comprising a memory and a processor for executing a program stored in the memory, wherein the program is executed to perform the hand detection method of any one of claims 1 to 7.
11. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the hand detection method of any one of claims 1 to 7.

Description

Hand detection method and device, nonvolatile storage medium and electronic equipment Technical Field The present application relates to the field of computer vision, and in particular, to a hand detection method and apparatus, a nonvolatile storage medium, and an electronic device. Background In the field of computer vision, especially for human-computer interaction interface design of Augmented Reality (AR), virtual Reality (VR) and intelligent devices, vision-based gesture recognition technology occupies a central position. The key first step is to accurately detect the hand in the image and locate its key points, which is relatively easy to implement in a one-hand scenario. However, conventional techniques face serious challenges when faced with complex scenarios, especially where the hands are tightly interacting. Currently, the mainstream hand detection technology mainly adopts Top-Down (Top-Down) or Bottom-Up (Bottom-Up) strategies. The top-down approach first uses the object detector to identify bounding boxes of the hand, and then estimates keypoints within each box independently. And the bottom-up method firstly estimates all key points and then carries out grouping attribution. While the Top-Down approach generally performs better in terms of detection accuracy, its detector is based on a Softmax classifier, which mutual exclusion property becomes frustrating when dealing with double-hand overlapping. When the hands are very close, crossed or overlapped, the detector is difficult to accurately judge the hand boundary of the same overlapping area, so that missed detection or false detection is extremely easy to cause, and particularly in a scene with high overlapping degree such as praying gestures. Even if two hand regions can be detected, the conventional keypoint estimation network may not be able to correctly distinguish between left and right hands due to the similarity of the input images, and output two sets of mirrored keypoint coordinates, resulting in an Identity-Switch (Identity-Switch) problem. Such confusion not only affects the accuracy of gesture recognition, but may mislead subsequent interaction logic, particularly in scenarios requiring accurate recognition of left and right hand operations. In order to solve the difficulty of two-hand detection and key point positioning in a single-frame image, some related schemes rely on time sequence information of a video sequence, and the detection result of the single frame is improved through tracking. However, this timing-dependent approach loses utility in single-picture application scenarios, while increasing the complexity and latency of the system, limiting its application to edge computing devices that are highly real-time demanding or sensitive to power consumption. In summary, the related vision-based hand detection and keypoint localization techniques have significant limitations and drawbacks when dealing with complex scenes, especially where the hands are tightly interacting. These drawbacks affect the practical application of the technology, and in particular, in AR, VR, and smart device interaction scenarios, direct impact on the user experience is caused. Disclosure of Invention The application provides a hand detection method, a hand detection device, a nonvolatile storage medium and electronic equipment, which at least solve the technical problems of detection confusion and key point identity confusion caused by difficulty in accurately distinguishing left and right hands due to hand detection and hand key point positioning. According to one aspect of the application, a hand detection method is provided, which comprises the steps of obtaining an image to be processed, processing the image by utilizing a computer vision model to obtain a plurality of prediction frames aiming at a hand region in the image and a left hand confidence score and a right hand confidence score corresponding to each prediction frame, wherein the left hand confidence score is used for quantifying the probability of representing the hand region as a left hand, the right hand confidence score is used for quantifying the probability of representing the hand region as a right hand, the left hand confidence score is larger than a first threshold and smaller than a prediction frame of a second threshold in the plurality of prediction frames, the left hand prediction frames are marked as prediction frames belonging to the left hand, the right hand confidence score is larger than the second threshold and smaller than the prediction frame of the first threshold, the right hand prediction frames are marked as right hand prediction frames belonging to the right hand, the target prediction frames are copied to obtain a first copy and a second copy, the first copy is marked as the right hand prediction frames, the target prediction frames are quantized to represent the probability of the hand region as the right hand prediction frames, the target prediction fram