EP-4121940-B1 - METHOD AND SYSTEM FOR MATCHING 2D HUMAN POSES FROM MULTIPLE VIEWS

EP4121940B1EP 4121940 B1EP4121940 B1EP 4121940B1EP-4121940-B1

Inventors

ZHANG, WENXIN

Dates

Publication Date: 20260506
Application Date: 20200320

Claims (7)

A computer-implemented method of identifying humans between two or more camera views from two-dimensional '2D' skeletons of the humans of each view comprising: a) for each skeleton in each of the two or more camera views, performing a pairwise scoring with each of the skeletons in another of the two or more camera views and assigning an affinity score (60) to each pair, wherein the pairwise scoring of a pair of skeletons from a pair of camera views comprises modelling a ray from each camera view to an element of the 2D skeleton associated with the camera view and determining the minimum distance between the two rays, and wherein the pairwise scoring further comprises determining a deviation of attributes of a putative three-dimensional '3D' skeleton formed from the 2D skeletons from a typical human; b) identifying a best match of a skeleton in a first camera view to a skeleton in a second camera view by maximizing the affinity score (70) of the pair; and c) grouping skeletons by identifying a set of skeletons in a first camera view, the set relating to the humans in the first camera view, with a set of skeletons in a second camera view using the best match.
The method of identifying humans between two or more camera views of claim 1 further comprising assigning an identifier to each skeleton in the grouped skeletons in a frame of the camera views, and assigning the same identifier to each skeleton in the grouped skeletons a subsequent frame of the camera views that match.
The method of claim 1 wherein if the rays are divergent, the pair is not included in the affinity score (60).
The method of claim 1 wherein the pairwise scoring of a pair of skeletons from a pair of camera views further comprises excluding elements where the minimum distance between the two rays exceeds a threshold.
The method of any one of claims 1 to 4 further comprising calibrating each camera view by determining the position and angle of the camera, and synchronizing the camera view by aligning frames taken at the same time from the one or more camera views.
The method of any one of claims 1 to 5 wherein identifying a best match of a skeleton in a first camera view to a skeleton in a second camera view includes not identifying any match.
A motion capture system for two or more humans comprising: a pair of calibrated cameras generating synchronized video streams, the pair of calibrated cameras including a first camera and a second camera having overlapping fields of view that include the two or more humans; a first 2D pose estimator module associated with the first camera and configured to generate a 2D skeleton for each human in the corresponding field of view for a frame of a first one of the synchronized video streams; a second 2D pose estimator module associated with the second camera and configured to generate a 2D skeleton for each human in the corresponding field of view for a frame of a second one of the synchronized video streams a scoring module (20) configured to perform a pairwise scoring for each of the 2D skeletons associated with the first camera with each of the 2D skeletons associated with the second camera and assigning an affinity score to each pair, wherein the scoring module is configured to perform the pairwise scoring by modelling a ray from each camera view to an element of the 2D skeleton associated with that camera view and determining the minimum distance between the two rays, and the scoring module is further configured to perform the pairwise scoring by determining a deviation of attributes of a putative three-dimensional '3D' skeleton formed from the 2D skeletons from a typical human; a matching module (30) that is configured to match a 2D skeleton in the first camera view to a 2D skeleton in the second camera view by maximizing the affinity score of the pair; a grouping module (50) that is configured to group 2D skeletons by identifying a set of 2D skeletons in the first camera view, the set relating to the humans in the first camera view, with a set of 2D skeletons in the second camera view using the best match; a temporal matching module (60) that is configured to assign an identifier to each 2D skeleton group that remains consistent across a sequence of frames of the video streams; and a 3D reconstruction module that is configured to combine the grouped 2D skeletons across a sequence of frames for a human to create a 3D skeleton of the human, capturing the position of the human.

Description

FIELD This disclosure relates to identifying and tracking 2D joint skeletons in video segments. More particularly, this disclosure relates to matching 2D skeletal data corresponding to the same person where the 2D data is extracted from frames of video segments taken from multiple viewpoints. BACKGROUND Reconstruction of 3D human poses from synchronized 2D video sequences may be accomplished in two stages. The first stage, 2D human pose estimation, detects keypoints in each frame of each video sequence. The second stage fuses the 2D keypoints, along with the camera calibration parameters, into 3D skeletons. 2D human pose estimators may rely on deep neural networks to detect keypoints, which may correspond to anatomical joints, in each video frame of a video sequence. A group of keypoints belonging to a single person may be connected to form a 2D skeleton. For scenes containing multiple persons, multiple 2D skeletons may be detected in each frame, and each is assigned an index or unique ID. Multi-person pose estimation may be accomplished by performing keypoint detection on multiple regions of interest, or it may be accomplished by detecting all keypoints in a single image frame jointly in "one shot" and then grouping them into individual 2D skeletons. For each person in the scene, 2D skeletons that correspond to the specific person are grouped together and the 3D skeleton is estimated through a data fusion technique. For instance, each 3D joint position may be independently estimated by triangulation of 2 or more keypoints. Alternatively, 3D joint positions may be estimated by Kalman Filters that model the motion of the joints over time. For scenes containing multiple persons, it may be important that 2D skeletons be grouped such that each group corresponds to a single person. Because the 2D skeletons in each view may be extracted independently, their indices or IDs are not correlated across views. Accordingly, a matching step is typically used to identify the 2D groups that get fused in order to recover the 3D skeletons. GB 2573170 describes the reconstruction of 3D skeletons from views of one or more 3D real world objects, whereby improved 2D or 3D images of the 3D real world objects can be generated from the reconstructed 3D skeletons. SUMMARY The invention is defined by the appended claims. This disclosure relates in an aspect to a method as defined in claim 1. In an aspect, this disclosure relates to a motion capture system as defined in claim 7. BRIEF DESCRIPTION OF THE DRAWINGS In drawings which illustrate by way of example only an embodiment of the disclosure, Figure 1 is an exemplary pictorial representation of 2D skeleton data derived from three video sequences, in accordance with an embodiment.Figure 2 is a block diagram of a system for matching 2D human poses, in accordance with an embodiment.Figure 3 is an exemplary table of affinity scores for a pair of views, and the matching pairs produced by the pairwise matching module, in accordance with an embodiment.Figure 4 is an exemplary graph of pairwise matches, and the connected components or cycles that represent groups that each correspond to a unique person. DETAILED DESCRIPTION This disclosure is directed to a method and system for matching human pose data in the form of 2D skeletons for the purposes of 3D reconstruction. The system may comprise a scoring module 20 that assigns an affinity score to each pair of cross-view 2D skeletons, a matching module 30 that assigns optimal pairwise matches based on the affinity scores, a grouping module 50 that assigns each 2D skeleton to a group such that each group corresponds to a unique person, based on the pairwise matches; and a temporal consistency module 60 that assigns each group an ID that maintains correspondence to the same person over the multi-video sequence. With reference to Figure 1, 2D skeleton data 10 is extracted from two or more video sequences, taken from calibrated cameras. To perform 3D reconstruction, the 2D skeletons may be matched across views. A calibrated camera is preferably a camera for which field of view, angle and location information is known. The two or more video sequences are preferably synchronized so that each of the video sequences include the same period of time and include at least some of the same humans/skeletons. In some instances, one or more humans/skeletons may leave the field of view of one or more of the cameras. A 2D human pose estimator may generate 2D skeletons for each human in each of the two or more video sequences. This may be done using known techniques, such as using a convolutional neural network (CNN), including such as by Wrnch. AI. A sequence of 2D skeletons may be provided corresponding to the video sequences for each camera. With reference to Figure 2, the 2D matching system may comprise the following modules: the pairwise scoring module 20, the pairwise matching module 30, the grouping module 40, and the temporal consistency module 50. The