CN-120977016-B - Gesture recognition method and device

CN120977016BCN 120977016 BCN120977016 BCN 120977016BCN-120977016-B

Abstract

The invention provides a gesture recognition method and a gesture recognition device, which relate to the technical field of computer vision, and the method comprises the steps of obtaining continuous multi-frame RGB images and depth images containing a human body to be detected; the method comprises the steps of detecting key points of a plurality of targets to be detected on a human body to be detected in an RGB image, calculating depth values corresponding to the key points based on the RGB image and the depth image, determining a current detection mode based on the depth values corresponding to the key points of the targets to be detected, determining a current target to be detected corresponding to the current detection mode from the targets to be detected, and identifying the gesture of the current target to be detected based on the key points of the current target to be detected and the corresponding depth values of the key points. According to the invention, the gestures of different targets to be detected are detected in different detection modes, the targeted gesture detection of each target to be detected is realized, the gesture recognition of different targets to be detected is realized by combining key point information and depth information in different detection modes, and the accuracy, the instantaneity and the stability of the gesture recognition are improved.

Inventors

LI HONGJUN
Xie Zhanrui
Zheng Shini
LIN CHENYU

Assignees

珠海凌烟阁芯片科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251021

Claims (8)

1. A method of gesture recognition, the method comprising: Acquiring continuous multi-frame RGB images and depth images containing a human body to be detected; detecting key points of a plurality of targets to be detected on the human body to be detected in an RGB image, and calculating depth values corresponding to the key points of the targets to be detected based on the RGB image and the depth image; determining a current detection mode based on the depth value corresponding to the key point of each target to be detected; determining a current to-be-detected target corresponding to a current detection mode from the to-be-detected targets, and identifying the gesture of the current to-be-detected target based on the key points of the current to-be-detected target and the depth values corresponding to the key points to obtain a corresponding gesture identification result; Wherein, after obtaining the continuous multi-frame RGB image and the depth image containing the human body to be detected, the method further comprises: Preprocessing an RGB image and a depth image to determine noise and abnormal values existing in the RGB image and the depth image and correct the noise and the abnormal values, wherein in the image preprocessing, range clipping is performed on the depth image and linear normalization processing is performed on the RGB image; The detecting key points of a plurality of targets to be detected on the human body to be detected in the RGB image comprises: Enhancing contour features and edge features of the human body to be detected in the RGB image by utilizing depth data in the depth image; Inputting continuous multi-frame RGB images into a pre-trained key point detection model to detect key points of a plurality of targets to be detected on the human body to be detected in the RGB images, wherein the pre-trained key point detection model is an AI pre-training gesture detection model; The determining the current detection mode based on the depth value corresponding to the key point of each target to be detected includes: determining an average depth value corresponding to each target to be detected according to the depth value corresponding to the key point of each target to be detected; Determining a current detection mode based on the average depth value corresponding to each target to be detected; When the plurality of objects to be detected include a human hand and a human torso, determining the current detection mode based on the average depth value corresponding to each of the objects to be detected includes: judging whether the average depth value of the human hand in each frame of RGB image meets a first preset distance condition or not, and judging whether the average depth value of the human trunk meets a second preset distance condition or not, so as to obtain a corresponding first judgment result; And determining the current detection mode as a hand detection mode for detecting the hands of the human body or a trunk detection mode for detecting the trunk of the human body based on the first judgment result corresponding to the continuous multi-frame RGB images.
2. The gesture recognition method of claim 1, further comprising, during the training of the pre-trained keypoint detection model: Performing space transformation or luminosity transformation on an image sample for training, and scaling or normalizing pixel coordinates of key points marked on the image sample according to scaling of the image sample to obtain a processed image sample; Constructing a training set by using the processed image samples, and sequentially inputting the processed image samples in the training set to an initial key point detection model constructed on the basis of Heatmap regression coding and decoding architecture to obtain corresponding prediction output; generating corresponding two-dimensional Gaussian distribution according to pixel coordinates of key points marked on the processed image sample to obtain GT Gaussian Heatmap; calculating a loss from Heatmap of the predicted output and the GT gaussian Heatmap, and calculating a gradient of the loss relative to model parameters; And updating the model parameters by using the gradient, and repeatedly traversing the training set until the current trained model meets the preset convergence condition to obtain the pre-trained key point detection model.
3. The gesture recognition method according to claim 1, wherein determining the current detection mode as the hand detection mode for detecting the hand of the human body or the torso detection mode for detecting the torso of the human body based on the first determination result corresponding to the continuous multi-frame RGB image includes: Determining a current detection mode as a hand detection mode for detecting the human hand when the first judgment result corresponding to the continuous multi-frame RGB image indicates that the average depth value of the human hand meets the first preset distance condition and the average depth value of the human trunk does not meet the second preset distance condition; And when the first judgment result corresponding to the continuous multi-frame images shows that the average depth value of the human hand does not meet the first preset distance condition and the average depth value of the human trunk meets the second preset distance condition, determining the current detection mode as a trunk detection mode for detecting the human trunk.
4. The gesture recognition method according to claim 1, wherein when the current detection mode is a hand detection mode for detecting a human hand, the determining a current target to be detected corresponding to the current detection mode from the targets to be detected, and recognizing the gesture of the current target to be detected based on the key point of the current target to be detected and the depth value corresponding to the key point, includes: Determining that a current to-be-detected target corresponding to a current detection mode is a human hand from all the to-be-detected targets, and determining joint angles of all fingers and three-dimensional intervals among all the fingers based on the key points of the human hand in each frame of image and the depth values corresponding to the key points; judging whether the current gesture of the human hand meets preset gesture judgment logic or not based on the joint angles of the fingers and the three-dimensional distances among the fingers, and obtaining a corresponding second judgment result; determining a static gesture of the human hand based on the second judgment result; Or determining that the current to-be-detected target corresponding to the current detection mode is a human hand from the to-be-detected targets, and determining three-dimensional position change and angle rotation change of a human hand central point in a three-dimensional space based on the key points of the human hand in the continuous multi-frame images and the depth values corresponding to the key points; and determining the dynamic gesture of the human hand based on the three-dimensional position change and the angular rotation change of the human hand center point.
5. The gesture recognition method according to claim 1, wherein when the current detection mode is a torso detection mode for detecting a torso of a human body, the determining a current target to be detected corresponding to the current detection mode from the targets to be detected, and the recognizing the gesture of the current target to be detected based on the key point of the current target to be detected and the depth value corresponding to the key point, includes: Determining that a current target to be detected corresponding to a current detection mode is a human body trunk from the targets to be detected, and determining the bending angles of the arm limbs and/or the knee limbs based on the key points of the human body trunk in each frame of image and the depth values corresponding to the key points; determining a pose of an arm limb and/or a knee limb based on the bending angle of the arm limb and/or the knee limb; or determining that the current to-be-detected target corresponding to the current detection mode is a human body trunk from the to-be-detected targets, and determining the change trend of the height of the gravity center of the human body based on the key points of the human body trunk in the continuous multi-frame images and the depth values corresponding to the key points; and judging whether falling occurs or not based on the change trend of the body gravity center height.
6. The gesture recognition method according to claim 1, wherein the calculating depth values corresponding to the key points of the objects to be detected based on the RGB image and the depth image includes: aligning the RGB image and the depth image by using a preset image registration algorithm; Mapping the coordinates of the key points into the corresponding aligned depth images to obtain depth values corresponding to the key points; Or determining depth values corresponding to the key points of the targets to be detected by using a bilinear difference mode; Or, converting the aligned depth image into a 3D point cloud; and determining depth values corresponding to the key points of the targets to be detected by utilizing a nearest neighbor searching mode based on the 3D point cloud.
7. The gesture recognition method according to any one of claims 1 to 6, wherein before the determining the current detection mode based on the depth value corresponding to the key point of each of the objects to be detected, further comprises: judging whether the depth value corresponding to the key point is larger than a preset depth threshold value or not; and filtering out the key points when the depth value corresponding to the key points is larger than the preset depth threshold value.
8. A gesture recognition apparatus, the apparatus comprising: The image acquisition module is used for acquiring continuous multi-frame RGB images and depth images containing the human body to be detected; The key point determining module is used for detecting key points of a plurality of targets to be detected on the human body to be detected in the RGB image; The depth value calculation module is used for calculating depth values corresponding to the key points of the targets to be detected based on the RGB image and the depth image; the mode determining module is used for determining a current detection mode based on the depth value corresponding to the key point of each target to be detected; The gesture recognition module is used for determining a current target to be detected corresponding to a current detection mode from the targets to be detected, and recognizing the gesture of the current target to be detected based on the key points of the current target to be detected and the depth values corresponding to the key points to obtain a corresponding gesture recognition result; wherein, gesture recognition device still includes: The device comprises a module, a control module and a control module, wherein the module is used for preprocessing an RGB image and a depth image to determine noise and abnormal values existing in the RGB image and the depth image and correct the noise and the abnormal values; the key point determining module is specifically configured to: Enhancing contour features and edge features of the human body to be detected in the RGB image by utilizing depth data in the depth image; Inputting continuous multi-frame RGB images into a pre-trained key point detection model to detect key points of a plurality of targets to be detected on the human body to be detected in the RGB images, wherein the pre-trained key point detection model is an AI pre-training gesture detection model; the mode determining module is specifically configured to: determining an average depth value corresponding to each target to be detected according to the depth value corresponding to the key point of each target to be detected; Determining a current detection mode based on the average depth value corresponding to each target to be detected; When a plurality of objects to be detected comprise human hands and human trunk, judging whether the average depth value of the human hands in each frame of RGB image meets a first preset distance condition or not, and judging whether the average depth value of the human trunk meets a second preset distance condition or not, so as to obtain a corresponding first judgment result; And determining the current detection mode as a hand detection mode for detecting the hands of the human body or a trunk detection mode for detecting the trunk of the human body based on the first judgment result corresponding to the continuous multi-frame RGB images.

Description

Gesture recognition method and device Technical Field The invention relates to the technical field of computer vision, in particular to a gesture recognition method and device. Background At present, the traditional gesture recognition technology is mostly based on 2D images (RGB images), is easily affected by background mixing or light change, so that the gesture recognition accuracy is not high, although some gesture recognition technologies can acquire depth information by combining with a TOF camera, the gesture recognition technology still has difficulty in recognizing gestures in a shielded or low-contrast scene only by relying on the depth information, namely, the problem of low gesture recognition accuracy exists when the traditional gesture recognition technology is in complex background, multi-light source interference and insufficient depth information. Moreover, in the conventional gesture recognition technology, gesture recognition is performed on a single part, such as a face, a hand, a trunk or the like, and the current part to be recognized and the gesture recognition mode of each part cannot be flexibly determined according to the states of a plurality of parts in images acquired by different shooting distances. In view of this, it has been a great need for a person skilled in the art to provide a solution to the above-mentioned technical problems. Disclosure of Invention The invention aims to solve the technical problems of the prior art, and provides a gesture recognition method and a gesture recognition device, which can improve the accuracy of gesture recognition. The technical scheme adopted for solving the technical problems is as follows: in a first aspect, a gesture recognition method, where the method includes: Acquiring continuous multi-frame RGB images and depth images containing a human body to be detected; detecting key points of a plurality of targets to be detected on the human body to be detected in an RGB image, and calculating depth values corresponding to the key points of the targets to be detected based on the RGB image and the depth image; determining a current detection mode based on the depth value corresponding to the key point of each target to be detected; Determining a current target to be detected corresponding to a current detection mode from the targets to be detected, and identifying the gesture of the current target to be detected based on the key points of the current target to be detected and the depth values corresponding to the key points to obtain a corresponding gesture identification result. Optionally, the detecting key points of a plurality of targets to be detected on the human body to be detected in the RGB image includes: Enhancing contour features and edge features of the human body to be detected in the RGB image by utilizing depth data in the depth image; Inputting continuous multi-frame RGB images into a pre-trained key point detection model to detect key points of a plurality of targets to be detected on the human body to be detected in the RGB images; wherein, in the training process of the pre-trained key point detection model, the method further comprises: Performing space transformation or luminosity transformation on an image sample for training, and scaling or normalizing pixel coordinates of key points marked on the image sample according to scaling of the image sample to obtain a processed image sample; Constructing a training set by using the processed image samples, and sequentially inputting the processed image samples in the training set to an initial key point detection model constructed on the basis of Heatmap regression coding and decoding architecture to obtain corresponding prediction output; generating corresponding two-dimensional Gaussian distribution according to pixel coordinates of key points marked on the processed image sample to obtain GT Gaussian Heatmap; calculating a loss from Heatmap of the predicted output and the GT gaussian Heatmap, and calculating a gradient of the loss relative to model parameters; And updating the model parameters by using the gradient, and repeatedly traversing the training set until the current trained model meets the preset convergence condition to obtain the pre-trained key point detection model. Optionally, the determining the current detection mode based on the depth value corresponding to the key point of each target to be detected includes: determining an average depth value corresponding to each target to be detected according to the depth value corresponding to the key point of each target to be detected; and determining a current detection mode based on the average depth value corresponding to each target to be detected. Optionally, when the plurality of objects to be detected include a human hand and a human torso, determining the current detection mode based on the average depth value corresponding to each of the objects to be detected includes: judging whether the average dep