US-12626404-B2 - Training a pose estimation model to determine anatomy keypoints in images

US12626404B2US 12626404 B2US12626404 B2US 12626404B2US-12626404-B2

Abstract

Provided are a computer program product, system, and method for training a pose estimation model to determine anatomy keypoints in images. A teacher network, implementing machine learning, processes images representing anatomies to produce heatmaps representing keypoints of the anatomies. An anatomy parsing network, implementing machine learning, processes the images to produce segmentation representations labeling anatomies represented in the images. The segmentation representations from the anatomy parsing network and the heatmaps from the teacher network are concatenated to produce mixed heatmaps. A pose estimation model, implementing machine learning, is trained to process the images to output predicted heatmaps to minimize a loss function of the output predicted heatmaps from the pose estimation model and the mixed heatmaps.

Inventors

Bo Wu
Chuang Gan
Yada Zhu
Pin-Yu Chen

Assignees

INTERNATIONAL BUSINESS MACHINES CORPORATION

Dates

Publication Date: 20260512
Application Date: 20230601

Claims (20)

1 . A computer program product for training a pose estimation model to locate anatomy keypoints in an anatomy represented in a digital image, the computer program product comprising a computer readable storage medium having computer readable program code embodied therein that is executable to perform operations, the operations comprising: processing, by a teacher network, implementing machine learning, images representing anatomies to produce heatmaps representing keypoints of the anatomies; processing, by an anatomy parsing network, implementing machine learning, the images to produce segmentation representations labeling anatomies represented in the images; concatenating the segmentation representations from the anatomy parsing network and the heatmaps from the teacher network to produce mixed heatmaps; and training the pose estimation model, implementing machine learning, to process the images to output predicted heatmaps to minimize a loss function of the output predicted heatmaps from the pose estimation model and the mixed heatmaps.
2 . The computer program product of claim 1 , wherein the operations further comprise: determining whether a probability that a mixed heatmap has keypoints within an anatomical region exceeds a probability threshold, wherein the mixed heatmap is used to train the pose estimation model for an image in response to the probability exceeding the probability threshold; and using a heatmap of the heatmaps for an image to train the pose estimation model in response to the probability not exceeding the probability threshold.
3 . The computer program product of claim 1 , wherein the pose estimation model comprises a student pose estimation model, wherein the operations further comprise: generating, by a teacher pose estimation model in the teacher network, a first set of heatmaps representing keypoints of an anatomy; and applying a Gaussian transformer to the first set of heatmaps to generate Gaussian heatmaps representing the keypoints of the anatomy, wherein the Gaussian heatmaps comprise the heatmaps concatenated with the segmentation representations to produce the mixed heatmaps.
4 . The computer program product of claim 1 , wherein a data set of images is used to train a segmentation model in the anatomy parsing network to produce segmentation representations that are concatenated with the heatmaps to produce the mixed heatmaps.
5 . The computer program product of claim 1 , wherein the anatomy parsing network comprises: a segmentation model that processes the images to produce predicted segmentations of the images; a pre-trained parsing model to produce pseudo-label segmentations of the images; and wherein the anatomy parsing network performs: determining a loss function to minimize a difference of the predicted segmentations from the segmentation model and the pseudo-label segmentations from the pre-trained parsing model; and training the segmentation model to produce predicted segmentations to minimize the loss function.
6 . The computer program product of claim 5 , wherein the operations further comprise: mapping, by an encoder, the images to image vectors to input into the pose estimation model in a pose estimation network to output predicted heatmaps, to input into the segmentation model of the anatomy parsing network to produce the predicted segmentations, and to input into the pre-trained parsing model to produce the pseudo-label segmentations used to train the segmentation model.
7 . The computer program product of claim 5 , wherein the images on which the pose estimation model and the segmentation model are trained comprise unlabeled images.
8 . The computer program product of claim 5 , wherein the images inputted into the pose estimation model and the segmentation model comprise hard augmented images resulting from a hard augmentation of the images, and wherein the images inputted into a teacher pose estimation model in the teacher network to produce the heatmaps comprise easy augmented images resulting an easy augmentation of the images.
9 . A system for training a pose estimation model to locate anatomy keypoints in an anatomy represented in a digital image, comprising: a teacher network, implementing machine learning, to process images representing anatomies to produce heatmaps representing keypoints of the anatomies; an anatomy parsing network, implementing machine learning, to process the images to produce segmentation representations labeling anatomies represented in the images; a heatmap selector to concatenate the segmentation representations from the anatomy parsing network and the heatmaps from the teacher network to produce mixed heatmaps; and a computer readable storage medium having program instructions executed to train the pose estimation model, implementing machine learning, to process the images to output predicted heatmaps to minimize a loss function of the output predicted heatmaps from the pose estimation model and mixed heatmaps.
10 . The system of claim 9 , wherein the heatmap selector further performs: determining whether a probability that a mixed heatmap has keypoints within an anatomical region exceeds a probability threshold, wherein the mixed heatmap is used to train the pose estimation model for an image in response to the probability exceeding the probability threshold; and using a heatmap of the heatmaps for an image to train the pose estimation model in response to the probability not exceeding the probability threshold.
11 . The system of claim 9 , wherein the pose estimation model comprises a student pose estimation model, wherein the teacher network further comprises: a teacher pose estimation model to generate a first set of heatmaps representing keypoints of an anatomy; and a Gaussian transformer to transform the first set of heatmaps to Gaussian heatmaps representing the keypoints of the anatomy, wherein the Gaussian heatmaps comprise the heatmaps concatenated with the segmentation representations to produce the mixed heatmaps.
12 . The system of claim 9 , wherein the anatomy parsing network comprises: a segmentation model in the anatomy parsing network trained with a data set of images to produce segmentation representations that are concatenated with the heatmaps to produce the mixed heatmaps.
13 . The system of claim 9 , wherein the anatomy parsing network comprises: a segmentation model to process the images to produce predicted segmentations of the images; and a pre-trained parsing model to produce pseudo-label segmentations of the images, wherein a loss function is used to minimize a difference of the predicted segmentations from the segmentation model and the pseudo-label segmentations from the pre-trained parsing model, and wherein the segmentation model is trained to produce predicted segmentations to minimize the loss function.
14 . The system of claim 13 , further comprising: an encoder to map the images to image vectors to input into the pose estimation model in a pose estimation network to output predicted heatmaps, to input into the segmentation model of the anatomy parsing network to produce the predicted segmentations, and to input into the pre-trained parsing model to produce the pseudo-label segmentations used to train the segmentation model.
15 . A method for training a pose estimation model to locate anatomy keypoints in an anatomy represented in a digital image, comprising: processing images representing anatomies to produce heatmaps representing keypoints of the anatomies; processing the images to produce segmentation representations labeling anatomies represented in the images; concatenating the segmentation representations and the heatmaps to produce mixed heatmaps; and training the pose estimation model, implementing machine learning, to process the images to output predicted heatmaps to minimize a loss function of the output predicted heatmaps from the pose estimation model and the mixed heatmaps.
16 . The method of claim 15 , further comprising: determining whether a probability that a mixed heatmap has keypoints within an anatomical region exceeds a probability threshold, wherein the mixed heatmap is used to train the pose estimation model for an image in response to the probability exceeding the probability threshold; and using a heatmap of the heatmaps for an image to train the pose estimation model in response to the probability not exceeding the probability threshold.
17 . The method of claim 15 , further comprising: generating a first set of heatmaps representing keypoints of an anatomy; and applying a Gaussian transformer to the first set of heatmaps to generate Gaussian heatmaps representing the keypoints of the anatomy, wherein the Gaussian heatmaps comprise the heatmaps concatenated with the segmentation representations to produce the mixed heatmaps.
18 . The method of claim 15 , further comprising: using a data set of images to train a segmentation model to produce segmentation representations that are concatenated with the heatmaps to produce the mixed heatmaps.
19 . The method of claim 15 , computer program product of claim 1 , further comprising: processing, by a segmentation model, the images to produce predicted segmentations of the images; generating, by a pre-trained parsing model, pseudo-label segmentations of the images; determining a loss function to minimize a difference of the predicted segmentations from the segmentation model and the pseudo-label segmentations from the pre-trained parsing model; and training the segmentation model to produce predicted segmentations to minimize the loss function.
20 . The method of claim 19 , further comprising: mapping, by an encoder, the images to image vectors to input into the pose estimation model in a pose estimation network to output predicted heatmaps, to input into the segmentation model to produce the predicted segmentations, and to input into the pre-trained parsing model to produce the pseudo-label segmentations used to train the segmentation model.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a computer program product, system, and method for training a pose estimation model to determine anatomy keypoints in images. 2. Description of the Related Art Human pose estimation computer technology seeks to locate the keypoints of the human anatomy (such as eyes, knees, arms, legs, etc.) within an image or a video, which is a fundamental task in the field of computer vision and has various practical applications, including action recognition, human-object interaction, pose tracking, and visual reasoning. Early efforts in deep learning directly regress keypoint coordinates from the given images. Recent supervised approaches have adopted a heatmap based framework for better supervision. Despite substantial advancements in supervised learning, substantial labeled data is crucial for its effectiveness. Improving pose estimation performance through the use of larger, high-quality datasets in supervised learning is costly, as collecting labeled data can be both time consuming and labor-intensive. To mitigate the need for labeled data, some attempts have been made towards semi-supervised human pose estimation, leveraging both limited labeled images and abundant unlabeled images. However, semi-supervised learning in human pose estimation is challenging due to the limited number of labeled images and the sparse labeling structure (i.e., the number of background pixels in images is dominant). Existing semi-supervised human pose estimation methods focus on effectively using abundant unlabeled images. Pseudo-labeling and consistency regularization are two common paradigms for utilizing unlabeled images. The pseudo-labeling paradigm generates pseudo-labels for unlabeled images using a fixed teacher network pre-trained on limited labeled images, and then uses these pseudo-labels for supervised training. However, the fixed teacher network's performance is limited by the initial labeled data, leading to the generation of incorrect pseudo-labels that cannot be rectified. There is a need in the art for improved techniques to train a pose estimation model to estimate keypoints in an image of a body. SUMMARY Provided are a computer program product, system, and method for training a pose estimation model to determine anatomy keypoints in images. A teacher network, implementing machine learning, processes images representing anatomies to produce heatmaps representing keypoints of the anatomies. An anatomy parsing network, implementing machine learning, processes the images to produce segmentation representations labeling anatomies represented in the images. The segmentation representations from the anatomy parsing network and the heatmaps from the teacher network are concatenated to produce mixed heatmaps. A pose estimation model, implementing machine learning, is trained to process the images to output predicted heatmaps to minimize a loss function of the output predicted heatmaps from the pose estimation model and the mixed heatmaps. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 illustrates an embodiment of a human pose training system. FIGS. 2a and 2b illustrate an embodiment of operations to use a teacher network and anatomy parsing network to train a pose estimation model in a pose estimation network. FIG. 3 illustrates a computing environment in which the components of FIGS. 1 and 2 may be implemented. DETAILED DESCRIPTION In current techniques using a teacher network to train a student network, the teacher network's predictions for unlabeled images (pseudo-labels) contain noise. Existing methods for pose estimations do not consider constraining the impact of noise present in pseudo-labels on the student network's learning which increases confidence in erroneous predictions because of learning from unreliable pseudo-labels. Further, existing methods over-rely on the sparse and noisy pseudo-labels. The pseudo heatmap reveals a limited high-density keypoint area within an image, indicating that predicted keypoint locations are sparsely distributed. This sparsity makes the pseudo heatmap less error-tolerant. The model cannot learn the approximate range of the correct keypoint with an incorrect pseudo heatmap. When learning with unlabeled images, the student network focuses solely on the potentially incorrect pseudo heatmaps, with the risk of overfitting to the wrong guidance. Described embodiments address the above technical problems by providing regional guidance for semi-supervised human pose estimation. The regional guidance comprises a type of supervision signal that covers regions of an image rather than individual pixels. The regional guidance extends the consistency regularization paradigm from the perspective of regional guidance. Described embodiments introduce dense regional guidance through human parsing, which aims to segment the semantics of different parts of the human body. For instance, humans estimate pose by leveraging sema