US-12620118-B2 - Skeleton recognition method, non-transitory computer-readable recording medium, and gymnastics scoring assist system

US12620118B2US 12620118 B2US12620118 B2US 12620118B2US-12620118-B2

Abstract

A skeleton recognition method includes extracting a plurality of first features presenting features of two-dimensional joint positions of a subject, based on two-dimensional input images that are input from a plurality of cameras that capture images of the subject generating, based on the first features, second feature group information containing a plurality of second features corresponding to a given number of joints of the subject, respectively sensing an abnormal second feature from the second feature group information and recognizing a 3D skeleton based on a result of integrating the second features that remain after removal of the abnormal second feature from the second feature group information, by using a processor.

Inventors

Tatsuya Suzuki
Yu Ishikawa

Assignees

FUJITSU LIMITED

Dates

Publication Date: 20260505
Application Date: 20230720

Claims (12)

1 . A skeleton recognition method comprising: extracting a plurality of first features presenting features of two-dimensional joint positions of a subject, based on two-dimensional input images that are input from a plurality of cameras that capture images of the subject; generating, based on the first features, second feature group information containing a plurality of second features corresponding to a given number of joints of the subject, respectively; sensing an abnormal second feature from the second feature group information; and recognizing a 3D skeleton based on a result of integrating the second features that remain after removal of the abnormal second feature from the second feature group information, by using a processor, wherein the second feature is coordinates and heatmap information in which likelihood of presence of a given joint is associated with the coordinates, and the sensing senses an abnormal feature based on a difference between the heatmap information and information of Gaussian distribution of likelihood that is specified previously.
2 . The skeleton recognition method according to claim 1 , wherein the generating generates a plurality of sets of second feature group information in time series, and the sensing senses an abnormal second feature based on a first vector in which a given pair of joints that are specified based on previous second feature group information serves as a start point and an end point and a second vector in which a given pair of joints that are specified based on current second feature group information serves as a start point and an end point.
3 . The skeleton recognition method according to claim 2 , wherein the sensing senses an abnormal second feature based on a relationship between an area that is specified from a given joint based on the second feature group information and positions of joints other than the given joint.
4 . The skeleton recognition method according to claim 1 , wherein the sensing calculates a plurality of epipolar lines using camera positions as viewpoints based on the heatmap information and sensing an abnormal second feature based on a distance between an intersection of the epipolar lines and a position of a joint.
5 . A non-transitory computer-readable recording medium having stored therein a skeleton recognition program that causes a computer to execute a process comprising: extracting a plurality of first features presenting features of two-dimensional joint positions of a subject, based on two-dimensional input images that are input from a plurality of cameras that capture images of a subject; generating, based on the first features, second feature group information containing a plurality of second features corresponding to a given number of joints of the subject, respectively; sensing an abnormal second feature from the second feature group information; and recognizing a 3D skeleton based on a result of integrating the second features that remain after removal of the abnormal second feature from the second feature group information, wherein the second feature is coordinates and heatmap information in which likelihood of presence of a given joint is associated with the coordinates, and the sensing senses an abnormal feature based on a difference between the heatmap information and information of Gaussian distribution of likelihood that is specified previously.
6 . The non-transitory computer-readable recording medium according to claim 5 , wherein the generating generates a plurality of sets of second feature group information in time series, and the sensing senses an abnormal second feature based on a first vector in which a given pair of joints that are specified based on previous second feature group information serves as a start point and an end point and a second vector in which a given pair of joints that are specified based on current second feature group information serves as a start point and an end point.
7 . The non-transitory computer-readable recording medium according to claim 6 , wherein the sensing senses an abnormal second feature based on a relationship between an area that is specified from a given joint based on the second feature group information and positions of joints other than the given joint.
8 . The non-transitory computer-readable recording medium according to claim 5 , wherein the sensing calculates a plurality of epipolar lines using camera positions as viewpoints based on the heatmap information and sensing an abnormal second feature based on a distance between an intersection of the epipolar lines and a position of a joint.
9 . A gymnastics scoring assist system including a plurality of cameras that capture images of a subject and a skeleton recognition apparatus comprising: a processor configured to: acquire two-dimensional input images that are input from the cameras; extract a plurality of first features presenting features of two-dimensional joint positions of the subject based on the input images generate, based on the first features, second feature group information containing a plurality of second features corresponding to a given number of joints of the subject, respectively; sense an abnormal second feature from the second feature group information; and recognize a 3D skeleton based on a result of synthesizing second features that remain after removal of the abnormal second feature from the second feature group information, wherein the second feature is coordinates and heatmap information in which likelihood of presence of a given joint is associated with the coordinates and the processor is further configured to sense an abnormal feature based on a difference between the heatmap information and information of Gaussian distribution of likelihood that is specified previously.
10 . The gymnastics scoring assist system according to claim 9 , wherein the processor is further configured to generate a plurality of sets of second feature group information in time series and sense an abnormal second feature based on a first vector in which a given pair of joints that are specified based on previous second feature group information serves as a start point and an end point and a second vector in which a given pair of joints that are specified based on current second feature group information serves as a start point and an end point.
11 . The gymnastics scoring assist system according to claim 10 , wherein the processor is further configured to sense an abnormal second feature based on a relationship between an area that is specified from a given joint based on the second feature group information and positions of joints other than the given joint.
12 . The gymnastics scoring assist system according to claim 9 , wherein the processor is further configured to calculate a plurality of epipolar lines using camera positions as viewpoints based on the heatmap information and sense an abnormal second feature based on a distance between an intersection of the epipolar lines and a position of a joint.

Description

This application is a continuation application of International Application PCT/JP2021/009267 filed on Mar. 9, 2021 and designating U.S., the entire contents of which are incorporated herein by reference. FIELD The present invention relates to a skeleton recognition method and the like. BACKGROUND As for detection of three-dimensional human motions, a 3D sensing technique of detecting 3D skeleton coordinates from a plurality of 3D laser sensors with accuracy of ±1 cm has been established. The 3D sensing technique is expected to be applied to a gymnastics scoring assist system and to be developed to other sports and other fields. A method using 3D laser sensors is referred to as a laser method. The laser method applies laser for about two million times per second and, based on the time of flight (ToF) of laser, calculates a depth of and information on each point of irradiation including a person of subject. The laser method can acquire accurate depth data; however, because a configuration and a process of laser scan and ToF measurement are complicated, has a disadvantage that hardware is complicated and expensive. 3D skeleton recognition is sometimes performed by an image method instead of the laser method. The image method is a method of acquiring RGB (Red Green Blue) data of each pixel using a CMOS (Complementary Metal Oxide Semiconductor) imager, and an inexpensive RGB camera is usable. A conventional technique of 3D skeleton recognition using 2D features with a plurality of cameras will be described here. After acquiring 2D features with each camera according to a human-body model that is defined in advance, the conventional technique recognizes a 3D skeleton using a result of integrating each 2D feature. For example, 2D skeleton information and headmap information are taken as the 2D features. FIG. 22 is a diagram illustrating an example of a human-body model. As illustrated in FIG. 22, a human-body model M1 consists of 21 joints. In the human-body model M1, each joint is presented by a node and numbers of 0 to 20 are assigned to the nodes. The relationship between the numbers of nodes and joint names is the relationship presented in a table Te1. For example, the joint name corresponding to the node 0 is “SPINE BASE”. Description of the joint names corresponding to the nodes 1 to 20 will be omitted. Conventional techniques include a method using triangulation and a method using machine learning. The method using triangulation includes triangulation using two cameras and triangulation using three or more cameras. For convenience, triangulation using two cameras is Conventional Technique 1, triangulation using three or more cameras is Conventional Technique 2, and the method using machine learning is Conventional Technique 3. FIG. 23 is a diagram for explaining triangulation using two cameras. In the conventional technique 1, triangulation is defined as a method of measuring a three-dimensional position of a subject P from a relationship of a triangle, using two cameras Ca1A and Ca1B. A camera image of the camera Ca1A is Im2A and a camera image of the camera Ca1B is Im2B. A 2D joint position in the camera image Im2A of the subject P is p1(x1,y1). A 2D joint position in the camera image Im2A of the subject P is pr(xr,yr) A distance between the cameras is b and a focal point distance is f. In the conventional technique 1, the 2D joint positions p1(x1, y1) and pr(xr, yr) are features and a three-dimensional joint position (X,Y,Z) is calculated by Equations (1), (2) and (3). The origin of (X,Y,Z) is at an optical center of the two cameras Ca1A and Ca1B. X=b⁡(xi+xr)/2⁢(xi-xr)(1)Y=b⁡(yi+yr)/2⁢(xi-xr)(2)Z=bf/(xi-xr)(3) According to Conventional Technique 1 described using FIG. 23, when incorrect 2D features are used to calculate a 3D skeleton, accuracy of the 3D skeleton lowers. FIG. 24 is a diagram for describing triangulation using three cameras. In triangulation using three cameras, triangulation described using FIG. 23 is extended to three or more cameras and the best combination of cameras is estimated by an algorithm referred to as RANSAC (Random Sample Consensus). As illustrated in FIG. 24, the apparatus of Conventional Technique 2 acquires 2D joint positions of a subject using all cameras 1-1, 1-2, 1-3 and 1-4 (step S1). The apparatus of Conventional Technique 2 chooses a combination of two cameras from all the cameras 1-1 to 1-4 and calculates 3D joint positions by triangulation described using FIG. 23 (step S2). The apparatus of Conventional Technique 2 re-projects a 3D skeleton to all the cameras 1-1 to 1-4 and counts the number of cameras whose difference from the 2D joint positions is at or under a threshold (step S3). The apparatus of Conventional Technique 2 repeatedly executes processing of steps S2 and S3 and employs, as the best combination of camera, a combination of two cameras with which the number of cameras whose differences from the 2D joint positions are at or under the threshold is the