US-12626528-B2 - Method for optimizing human body posture recognition model, device and computer-readable storage medium

US12626528B2US 12626528 B2US12626528 B2US 12626528B2US-12626528-B2

Abstract

A method includes: obtaining heat maps including a predetermined number of key points of a human body; performing depth separable convolution on a feature map corresponding to one of the heat maps corresponding to each of the key points and a convolution kernel of a corresponding channel of the human body posture recognition model to determine a key point feature map corresponding to each channel of the human body posture recognition model; performing local feature fusion processing and/or global feature fusion processing on the key point feature map corresponding to each channel to obtain fusion posture feature maps; determining a linear relationship between the channels of the human body posture recognition model based on the fusion posture feature maps; and updating weight coefficients of the corresponding channels of the human body posture recognition model by using the linear relationship between the channels of the human body posture recognition model.

Inventors

Bin Sun
Mingguo Zhao
Youjun Xiong

Assignees

UBTECH ROBOTICS CORP LTD

Dates

Publication Date: 20260512
Application Date: 20230626
Priority Date: 20201229

Claims (20)

1 . A computer-implemented method for optimizing a human body posture recognition model, the method comprising: obtaining heat maps comprising a predetermined number of key points of a human body by using a preset posture estimation algorithm; performing depth separable convolution on a feature map corresponding to one of the heat maps corresponding to each of the key points and a convolution kernel of a corresponding channel of the human body posture recognition model to determine a key point feature map corresponding to each channel of the human body posture recognition model; performing local feature fusion processing and/or global feature fusion processing on the key point feature map corresponding to each channel of the human body posture recognition model to obtain fusion posture feature maps; determining a linear relationship between the channels of the human body posture recognition model based on the fusion posture feature maps; and updating weight coefficients of the corresponding channels of the human body posture recognition model by using the linear relationship between the channels of the human body posture recognition model; wherein performing local feature fusion processing on the key point feature map corresponding to each channel of the human body posture recognition model comprises: using key point feature maps corresponding to channels of the human body posture recognition model as feature maps to be locally fused; dividing the feature maps to be locally fused corresponding to the channels of the human body posture recognition model into multiple feature map groups according to a preset grouping rule; performing local feature fusion processing using the feature maps to be locally fused corresponding to an i-th channel in a g-th feature map group and the feature maps to be locally fused corresponding to each channel in the g-th feature map group except the feature maps to be locally fused corresponding to the i-th channel to obtain local fusion feature maps of key points corresponding to the channels of the human body posture recognition model; and determining the fusion posture feature maps based on the local fusion feature maps of key points corresponding to the channels of the human body posture recognition model.
2 . The method of claim 1 , wherein after obtaining the local fusion feature maps of key points corresponding to the channels of the human body posture recognition model, performing global feature fusion processing comprises: performing an average pooling operation on the local fusion feature maps of key points corresponding to the channels of the human body posture recognition model to obtain global fusion feature maps of key points corresponding to the channels of the human body posture recognition model; and determining the fusion posture feature maps based on the global fusion feature maps of key points corresponding to the channels of the human body posture recognition model.
3 . The method of claim 1 , wherein performing global feature fusion processing on the key point feature map corresponding to each channel of the human body posture recognition model comprises: performing an average pooling operation on the key point feature maps corresponding to channels of the human body posture recognition model to obtain global fusion feature maps of key points corresponding to the channels of the human body posture recognition model; and determining the fusion posture feature maps based on the global fusion feature maps of key points corresponding to the channels of the human body posture recognition model.
4 . The method of claim 3 , wherein after obtaining the global fusion feature maps of key points corresponding to the channels of the human body posture recognition model, performing local feature fusion processing comprises: using global fusion feature maps of key points corresponding to channels of the human body posture recognition model as feature maps to be locally fused; dividing the feature maps to be locally fused corresponding to the channels of the human body posture recognition model into multiple feature map groups according to a preset grouping rule; performing local feature fusion processing using the feature maps to be locally fused corresponding to an i-th channel in a g-th feature map group and the feature maps to be locally fused corresponding to each channel in the g-th feature map group except the feature maps to be locally fused corresponding to the i-th channel to obtain local fusion feature maps of key points corresponding to the channels of the human body posture recognition model; and determining the fusion posture feature maps based on the local fusion feature maps of key points corresponding to the channels of the human body posture recognition model.
5 . The method of claim 1 , wherein local feature fusion processing is performed according to the following equation: U g [ i ] = U g 1 [ i ] + f ⁡ ( U g 1 [ Ω g ⁢ \ ⁢ i ] ) , where U g [i] represents local fusion feature maps of key points corresponding to an i-th channel in a g-th feature map group, U g 1 [ Ω g ⁢ \ ⁢ i represents the feature maps to be locally fused corresponding to each channel in the g-th feature map group except the feature maps to be locally fused corresponding to the i-th channel, f( ) represents convolution operation, U g 1 [ i ] represents the feature maps to be locally fused corresponding to the i-th channel in the g-th feature map group, 1≤i≤N, N represents a total number of key points included in a key point set of the g-th feature map group, g≤G, and G represents an amount of the feature map groups.
6 . The method of claim 1 , wherein the feature map groups comprise: a first group that comprises the key points corresponding to a right ear, a left ear, a right eye, a left eye, a nose and a neck; a second group that comprises the key points corresponding to a right shoulder, a right elbow and a right hand; a third group that comprises the key points corresponding to a left shoulder, a left elbow and a left hand; a fourth group that comprises the key points corresponding to a right hip, a right knee and a right ankle; and a fifth group that comprises the key points corresponding to a left hip, a left knee and a left ankle.
7 . The method of claim 1 , wherein the linear relationship between the channels of the human body posture recognition model is determined according to the following equation: s=σ(W 2 δ(W 1 z)), where s represents the linear relationship between the channels of the human body posture recognition model, δ represents ReLU function, σ represents sigmoid activation function, W 1 ∈R C×C and W 2 ∈R C×C , which represents two fully connected layers, C represents a total number of channels of the human posture recognition model, and z represents the fusion posture feature maps.
8 . A device comprising: one or more processors; and a memory coupled to the one or more processors, the memory storing programs that, when executed by the one or more processors, cause performance of operations comprising: obtaining heat maps comprising a predetermined number of key points of a human body by using a preset posture estimation algorithm; performing depth separable convolution on a feature map corresponding to one of the heat maps corresponding to each of the key points and a convolution kernel of a corresponding channel of a human body posture recognition model to determine a key point feature map corresponding to each channel of the human body posture recognition model; performing local feature fusion processing and/or global feature fusion processing on the key point feature map corresponding to each channel of the human body posture recognition model to obtain fusion posture feature maps; determining a linear relationship between the channels of the human body posture recognition model based on the fusion posture feature maps; and updating weight coefficients of the corresponding channels of the human body posture recognition model by using the linear relationship between the channels of the human body posture recognition model; wherein performing global feature fusion processing on the key point feature map corresponding to each channel of the human body posture recognition model comprises: performing an average pooling operation on the key point feature maps corresponding to channels of the human body posture recognition model to obtain global fusion feature maps of key points corresponding to the channels of the human body posture recognition model; and determining the fusion posture feature maps based on the global fusion feature maps of key points corresponding to the channels of the human body posture recognition model; and wherein after obtaining the global fusion feature maps of key points corresponding to the channels of the human body posture recognition model, performing local feature fusion processing comprises: using global fusion feature maps of key points corresponding to channels of the human body posture recognition model as feature maps to be locally fused; dividing the feature maps to be locally fused corresponding to the channels of the human body posture recognition model into multiple feature map groups according to a preset grouping rule; performing local feature fusion processing using the feature maps to be locally fused corresponding to an i-th channel in a g-th feature map group and the feature maps to be locally fused corresponding to each channel in the g-th feature map group except the feature maps to be locally fused corresponding to the i-th channel to obtain local fusion feature maps of key points corresponding to the channels of the human body posture recognition model; and determining the fusion posture feature maps based on the local fusion feature maps of key points corresponding to the channels of the human body posture recognition model.
9 . The device of claim 8 , wherein performing local feature fusion processing on the key point feature map corresponding to each channel of the human body posture recognition model comprises: using key point feature maps corresponding to channels of the human body posture recognition model as feature maps to be locally fused; dividing the feature maps to be locally fused corresponding to the channels of the human body posture recognition model into multiple feature map groups according to a preset grouping rule; performing local feature fusion processing using the feature maps to be locally fused corresponding to an i-th channel in a g-th feature map group and the feature maps to be locally fused corresponding to each channel in the g-th feature map group except the feature maps to be locally fused corresponding to the i-th channel to obtain local fusion feature maps of key points corresponding to the channels of the human body posture recognition model; and determining the fusion posture feature maps based on the local fusion feature maps of key points corresponding to the channels of the human body posture recognition model.
10 . The device of claim 9 , wherein after obtaining the local fusion feature maps of key points corresponding to the channels of the human body posture recognition model, performing global feature fusion processing comprises: performing an average pooling operation on the local fusion feature maps of key points corresponding to the channels of the human body posture recognition model to obtain global fusion feature maps of key points corresponding to the channels of the human body posture recognition model; and determining the fusion posture feature maps based on the global fusion feature maps of key points corresponding to the channels of the human body posture recognition model.
11 . The device of claim 9 , wherein local feature fusion processing is performed according to the following equation: U g [ i ] = U g 1 [ i ] + f ⁡ ( U g 1 [ Ω g ⁢ \ ⁢ i ] ) , where U g [i] represents local fusion feature maps of key points corresponding to an i-th channel in a g-th feature map group, U g 1 [ Ω g ⁢ \ ⁢ i represents the feature maps to be locally fused corresponding to each channel in the g-th feature map group except the feature maps to be locally fused corresponding to the i-th channel, f( ) represents convolution operation, U g 1 [ i ] represents the feature maps to be locally fused corresponding to the i-th channel in the g-th feature map group, 1≤i≤N, N represents a total number of key points included in a key point set of the g-th feature map group, g≤G, and G represents an amount of the feature map groups.
12 . The device of claim 9 , wherein the feature map groups comprise: a first group that comprises the key points corresponding to a right ear, a left ear, a right eye, a left eye, a nose and a neck; a second group that comprises the key points corresponding to a right shoulder, a right elbow and a right hand; a third group that comprises the key points corresponding to a left shoulder, a left elbow and a left hand; a fourth group that comprises the key points corresponding to a right hip, a right knee and a right ankle; and a fifth group that comprises the key points corresponding to a left hip, a left knee and a left ankle.
13 . The device of claim 8 , wherein the linear relationship between the channels of the human body posture recognition model is determined according to the following equation: s=σ(W 2 δ(W 1 z)), where s represents the linear relationship between the channels of the human body posture recognition model, δ represents ReLU function, σ represents sigmoid activation function, W 1 ∈R C×C and W 2 ∈R C×C , which represents two fully connected layers, C represents a total number of channels of the human posture recognition model, and z represents the fusion posture feature maps.
14 . A non-transitory computer-readable storage medium storing instructions that, when executed by at least one processor of a device, cause the at least one processor to perform a method, the method comprising: obtaining heat maps comprising a predetermined number of key points of a human body by using a preset posture estimation algorithm; performing depth separable convolution on a feature map corresponding to one of the heat maps corresponding to each of the key points and a convolution kernel of a corresponding channel of a human body posture recognition model to determine a key point feature map corresponding to each channel of the human body posture recognition model; performing local feature fusion processing and/or global feature fusion processing on the key point feature map corresponding to each channel of the human body posture recognition model to obtain fusion posture feature maps; determining a linear relationship between the channels of the human body posture recognition model based on the fusion posture feature maps; and updating weight coefficients of the corresponding channels of the human body posture recognition model by using the linear relationship between the channels of the human body posture recognition model; wherein performing local feature fusion processing on the key point feature map corresponding to each channel of the human body posture recognition model comprises: using key point feature maps corresponding to channels of the human body posture recognition model as feature maps to be locally fused; dividing the feature maps to be locally fused corresponding to the channels of the human body posture recognition model into multiple feature map groups according to a preset grouping rule; performing local feature fusion processing using the feature maps to be locally fused corresponding to an i-th channel in a g-th feature map group and the feature maps to be locally fused corresponding to each channel in the g-th feature map group except the feature maps to be locally fused corresponding to the i-th channel to obtain local fusion feature maps of key points corresponding to the channels of the human body posture recognition model; and determining the fusion posture feature maps based on the local fusion feature maps of key points corresponding to the channels of the human body posture recognition model.
15 . The non-transitory computer-readable storage medium of claim 14 , wherein after obtaining the local fusion feature maps of key points corresponding to the channels of the human body posture recognition model, performing global feature fusion processing comprises: performing an average pooling operation on the local fusion feature maps of key points corresponding to the channels of the human body posture recognition model to obtain global fusion feature maps of key points corresponding to the channels of the human body posture recognition model; and determining the fusion posture feature maps based on the global fusion feature maps of key points corresponding to the channels of the human body posture recognition model.
16 . The non-transitory computer-readable storage medium of claim 14 , wherein performing global feature fusion processing on the key point feature map corresponding to each channel of the human body posture recognition model comprises: performing an average pooling operation on the key point feature maps corresponding to channels of the human body posture recognition model to obtain global fusion feature maps of key points corresponding to the channels of the human body posture recognition model; and determining the fusion posture feature maps based on the global fusion feature maps of key points corresponding to the channels of the human body posture recognition model.
17 . The non-transitory computer-readable storage medium of claim 16 , wherein after obtaining the global fusion feature maps of key points corresponding to the channels of the human body posture recognition model, performing local feature fusion processing comprises: using global fusion feature maps of key points corresponding to channels of the human body posture recognition model as feature maps to be locally fused; dividing the feature maps to be locally fused corresponding to the channels of the human body posture recognition model into multiple feature map groups according to a preset grouping rule; performing local feature fusion processing using the feature maps to be locally fused corresponding to an i-th channel in a g-th feature map group and the feature maps to be locally fused corresponding to each channel in the g-th feature map group except the feature maps to be locally fused corresponding to the i-th channel to obtain local fusion feature maps of key points corresponding to the channels of the human body posture recognition model; and determining the fusion posture feature maps based on the local fusion feature maps of key points corresponding to the channels of the human body posture recognition model.
18 . The non-transitory computer-readable storage medium of claim 14 , wherein local feature fusion processing is performed according to the following equation: U g [ i ] = U g 1 [ i ] + f ⁡ ( U g 1 [ Ω g ⁢ \ ⁢ i ] ) , where U g [i] represents local fusion feature maps of key points corresponding to an i-th channel in a g-th feature map group, U g 1 [ Ω g ⁢ \ ⁢ i represents the feature maps to be locally fused corresponding to each channel in the g-th feature map group except the feature maps to be locally fused corresponding to the i-th channel, f( ) represents convolution operation, U g 1 [ i ] represents the feature maps to be locally fused corresponding to the i-th channel in the g-th feature map group, 1≤i≤N, N represents a total number of key points included in a key point set of the g-th feature map group, g≤G, and G represents an amount of the feature map groups.
19 . The non-transitory computer-readable storage medium of claim 14 , wherein the feature map groups comprise: a first group that comprises the key points corresponding to a right ear, a left ear, a right eye, a left eye, a nose and a neck; a second group that comprises the key points corresponding to a right shoulder, a right elbow and a right hand; a third group that comprises the key points corresponding to a left shoulder, a left elbow and a left hand; a fourth group that comprises the key points corresponding to a right hip, a right knee and a right ankle; and a fifth group that comprises the key points corresponding to a left hip, a left knee and a left ankle.
20 . The non-transitory computer-readable storage medium of claim 14 , wherein the linear relationship between the channels of the human body posture recognition model is determined according to the following equation: s=σ(W 2 δ(W 1 z)), where s represents the linear relationship between the channels of the human body posture recognition model, δ represents ReLU function, σ represents sigmoid activation function, W 1 ∈R C×C and W 2 ∈R C×C , which represents two fully connected layers, C represents a total number of channels of the human posture recognition model, and z represents the fusion posture feature maps.

Description

CROSS REFERENCE TO RELATED APPLICATIONS The present application is a continuation-application of International Application PCT/CN2021/132113, with an international filing date of Nov. 22, 2021, which claims foreign priority to Chinese Patent Application No. 202011590719.7, filed on Dec. 29, 2020 in the China National Intellectual Property Administration, the contents of all of which are hereby incorporated by reference in its entirety. TECHNICAL FIELD The present disclosure generally relates to artificial intelligence, and particularly to a method for optimizing a human body posture recognition model, device and computer-readable storage medium. BACKGROUND The main task of human posture estimation is to locate the key points (e.g., elbows, wrists, knees, etc.) of a human body from input images, which has certain practical application value in various visual scenarios such as motion recognition and human-computer interaction. In the field of service robots, a human body posture estimation algorithm can allow robots to better understand human actions, which is the basis for robots to understand and analyze various human behaviors. However, some conventional methods directly calculate the error between heat maps and true values, and do not further analyze the heat maps, which results in low recognition accuracy. Therefore, there is a need to provide a method for optimizing a human body posture recognition model to overcome the above-mentioned problem. BRIEF DESCRIPTION OF DRAWINGS Many aspects of the present embodiments can be better understood with reference to the following drawings. The components in the drawings are not necessarily drawn to scale, the emphasis instead being placed upon clearly illustrating the principles of the present embodiments. Moreover, in the drawings, all the views are schematic, and like reference numerals designate corresponding parts throughout the several views. FIG. 1 is a schematic block diagram of a device according to one embodiment. FIG. 2 is an exemplary flowchart of a method for optimizing a human body posture recognition model according to one embodiment. FIG. 3 is a schematic diagram showing 18 key points of a human body. FIG. 4 is a schematic diagram showing 25 key points of a human body. FIG. 5 is a schematic diagram showing multi-layer feature extraction according to one embodiment. FIG. 6 is a schematic diagram showing multi-layer feature extraction according to another embodiment. FIG. 7 is an exemplary flowchart of a method for optimizing a human body posture recognition model according to another embodiment. FIG. 8 is a schematic diagram showing 18 key points of a human body that are divided into five groups. FIG. 9 is an exemplary flowchart of a method for optimizing a human body posture recognition model according to another embodiment. FIG. 10 is an exemplary flowchart of a method for optimizing a human body posture recognition model according to another embodiment. FIG. 11 is an exemplary flowchart of a method for optimizing a human body posture recognition model according to another embodiment. FIG. 12 is schematic block diagram of a human body posture recognition model optimization device according to one embodiment. DETAILED DESCRIPTION The disclosure is illustrated by way of example and not by way of limitation in the figures of the accompanying drawings, in which like reference numerals indicate similar elements. It should be noted that references to “an” or “one” embodiment in this disclosure are not necessarily to the same embodiment, and such references can mean “at least one” embodiment. Although the features and elements of the present disclosure are described as embodiments in particular combinations, each feature or element can be used alone or in other various combinations within the principles of the present disclosure to the full extent indicated by the broad general meaning of the terms in which the appended claims are expressed. Human body posture estimation algorithms can be mainly divided into two categories: top-down methods and bottom-up approaches. The top-down approach mainly consists of two stages, namely object detection and single-person key point detection. Object detection algorithms are to detect all people in an input image. Single-person key point detection is to estimate the posture of each person in the image, and find the required key points in each cropped person, such as the head, left hand, and right foot. The bottom-up approach mainly consists of two parts, key point detection and key point matching. Key point detection is to locate the unidentified key points of all people in the input image by predicting the heat maps corresponding to different key points. Key point matching is to use some association or matching algorithms (e.g. greedy algorithm, dynamic planning, tag matching, etc.) to connect different key points of different people together to generate different individuals. Both the top-down approach and the bottom