US-12620193-B2 - Computer-readable recording medium storing image processing program, image processing apparatus, and image processing method

US12620193B2US 12620193 B2US12620193 B2US 12620193B2US-12620193-B2

Abstract

A method for automatically inferring a 2D scale range threshold to eliminate 2D object bounding boxes outside the target non-planar zone in a corresponding monocular image. The method leverages given camera parameters, 3D coordinates of the target non-planar zone vertices, and the real 3D scale of the target object. The process comprises three primary steps: 1) Deriving the real 3D object scale range from pre-acquired data, 2) Estimating the corresponding 2D bounding box scale range in the image using camera parameters, 3D coordinates of the target non-planar zone, and the real 3D scale range obtained in the first step, 3) Eliminating bounding boxes that fall outside the bounding box scale range.

Inventors

Fan Yang

Assignees

FUJITSU LIMITED

Dates

Publication Date: 20260505
Application Date: 20240315

Claims (15)

1 . A non-transitory computer-readable recording medium storing an image processing program causing a computer to execute processes of: extracting a bounding box of a region related to each of a plurality of objects from an image which is captured by an imaging apparatus; determining a threshold value based on a range of a space in which a target object moves and a position of the imaging apparatus; comparing a size of the region related to each of the plurality of objects with the threshold value; and extracting a bounding box of a region related to the target object from regions related to the plurality of objects based on a comparison result and performing skeleton recognition.
2 . The non-transitory computer-readable recording medium according to claim 1 , wherein a process of determining the threshold value includes determining the threshold value based on the range of the space in which the target object moves, the position of the imaging apparatus and a size of the target object.
3 . The non-transitory computer-readable recording medium according to claim 2 , wherein a shape of the target object changes in accordance with a movement of the target object, and the processes further include determining the size of the target object based on statistical information regarding a frequency of occurrence of each shape of the plurality of objects.
4 . The non-transitory computer-readable recording medium according to claim 1 , wherein the threshold value is a first threshold value indicating a lower limit of a size of the region related to the target object, the processes further include determining a second threshold value indicating an upper limit of the size of the region related to the target object based on the range of the space in which the target object moves and the position of the imaging apparatus, and a process of extracting the region related to the target object includes: comparing a size of the region related to each of the plurality of objects with the first threshold value; comparing the size of the region related to each of a plurality of objects with the second threshold value; and extracting, as the region related to the target object, a region having a size larger than the first threshold value and smaller than the second threshold value from among regions related to each of the plurality of objects.
5 . The non-transitory computer-readable recording medium according to claim 1 , wherein each of the plurality of objects is a person, and the target object is a person who performs a performance in the space in which the target object moves.
6 . An image processing apparatus comprising: a memory; and a processor coupled to the memory and configured to: extract a bounding box of a region related to each of a plurality of objects from an image which is captured by an imaging apparatus; determine a threshold value based on a range of a space in which a target object moves and a position of the imaging apparatus; and compare a size of the region related to each of the plurality of objects with the threshold value, extract a bounding box of a region related to the target object from regions related to the plurality of objects based on a comparison result, and perform skeleton recognition.
7 . The image processing apparatus according to claim 6 , wherein the processor determines the threshold value includes determining the threshold value based on the range of the space in which the target object moves, the position of the imaging apparatus and a size of the target object.
8 . The image processing apparatus according to claim 7 , wherein a shape of the target object changes in accordance with a movement of the target object, and the processor determines the size of the target object based on statistical information regarding a frequency of occurrence of each shape of the plurality of objects.
9 . The image processing apparatus according to claim 6 , wherein the threshold value is a first threshold value indicating a lower limit of a size of the region related to the target object, the processor: determines a second threshold value indicating an upper limit of the size of the region related to the target object based on the range of the space in which the target object moves and the position of the imaging apparatus, compares a size of the region related to each of the plurality of objects with the first threshold value; compares the size of the region related to each of a plurality of objects with the second threshold value; and extracts, as the region related to the target object, a region having a size larger than the first threshold value and smaller than the second threshold value from among regions related to each of the plurality of objects.
10 . The image processing apparatus according to claim 6 , wherein each of the plurality of objects is a person, and the target object is a person who performs a performance in the space in which the target object moves.
11 . An image processing method which is executed a computer, the method comprising: extracting a bounding box of a region related to each of a plurality of objects from an image which is captured by an imaging apparatus; determining a threshold value based on a range of a space in which a target object moves and a position of the imaging apparatus; comparing a size of the region related to each of the plurality of objects with the threshold value; and extracting a bounding box of a region related to the target object from regions related to the plurality of objects based on a comparison result and performing skeleton recognition.
12 . The image processing method according to claim 11 , wherein a process of determining the threshold value includes determining the threshold value based on the range of the space in which the target object moves, the position of the imaging apparatus and a size of the target object.
13 . The image processing method according to claim 12 , wherein a shape of the target object changes in accordance with a movement of the target object, and the method further includes determining the size of the target object based on statistical information regarding a frequency of occurrence of each shape of the plurality of objects.
14 . The image processing method according to claim 11 , wherein the threshold value is a first threshold value indicating a lower limit of a size of the region related to the target object, the method further includes determining a second threshold value indicating an upper limit of the size of the region related to the target object based on the range of the space in which the target object moves and the position of the imaging apparatus, and a process of extracting the region related to the target object includes: comparing a size of the region related to each of the plurality of objects with the first threshold value; comparing the size of the region related to each of a plurality of objects with the second threshold value; and extracting, as the region related to the target object, a region having a size larger than the first threshold value and smaller than the second threshold value from among regions related to each of the plurality of objects.
15 . The image processing method according to claim 11 , wherein each of the plurality of objects is a person, and the target object is a person who performs a performance in the space in which the target object moves.

Description

CROSS-REFERENCE TO RELATED APPLICATION This application is a continuation application of International Application PCT/JP2021/035727 filed on Sep. 28, 2021 and designated the U.S., the entire contents of which are incorporated herein by reference. FIELD The present disclosure relates to an image processing technique. BACKGROUND In relation to an image process, a technique is known in which a region surrounding an object such as a person in an image is extracted from the image captured by a camera. The region surrounding the object may be referred to as a bounding box. Related art is disclosed in Japanese Laid-Open Patent Publication No. 2010-45501, Japanese Laid-Open Patent Publication No. 2011-209794 and U.S. Pat. No. 6,678,413. Related art is disclosed in Shoichi Masui et al., “Practical application of gymnastic grading support system by 3D sensing and technique recognition technology”, Journal of Information Technology Society of Japan, “Digital Practice Corner”, Vol. 1, No. 1, October 2020, Z. Tang et al., “MOANA: An Online Learned Adaptive Appearance Model for Robust Multiple Object Tracking in 3D”, IEEE Access, Vol. 7, 2019, pages 31934-31945, L. Citraro et al., “Real-Time Camera Pose Estimation for Sports Fields”, arXiv. org, arXiv:2003. 14109v1, March 2020, 12 pages, N. Homayounfar et al., “Sports Field Localization via Deep Structured Models”, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pages 4012-4020, and F. Yang et al., “Using Panoramic Videos for Multi-person Localization and Tracking in a 3D Panoramic Coordinate”, arXiv. org, arXiv:1911.10535v5, March 2020, 5 pages. SUMMARY According to an aspect of the embodiments, a non-transitory computer-readable recording medium stores an image processing program causing a computer to execute processes of: deriving the real 3D object scale range from pre-acquired data; estimating the corresponding 2D bounding box scale range in the image using camera parameters, 3D coordinates of the target non-planar zone, and the real 3D scale range obtained in the first step; eliminating bounding boxes that fall outside the bounding box scale range. The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention. BRIEF DESCRIPTION OF DRAWINGS FIG. 1 is a flowchart of a method of a comparative example 1; FIG. 2 is a flowchart of a method of a comparative example 2; FIG. 3 is a functional block diagram of an image processing apparatus; FIG. 4 is a flowchart of an image process; FIG. 5 is a block diagram of a multi-viewpoint image processing system; FIG. 6 is a functional block diagram of an image processing apparatus in the multi-viewpoint image processing system; FIGS. 7A to 7C are a diagram illustrating a target 3D space for the gymnastic performance; FIG. 8 is a diagram illustrating a bounding box for the target gymnast; FIGS. 9A and 9B are a diagram illustrating the target space, the target bounding box, and the non-target bounding boxes; FIGS. 10A to 10E are a diagram illustrating the real 3D gymnast scales with various poses; FIGS. 11A and 11B are a diagram illustrating about the scale range that is estimated from the statistical distribution of pre-obtained 3D gymnast poses; FIGS. 12A and 12B are a diagram illustrating D_max and D_min can be estimated by employing the camera parameters and the 3D coordinates of the target non-planar zone; FIGS. 13A and 13B are a diagram illustrating most of non-target bounding boxes that are out of the target 3D space are filtered out; FIGS. 14A to 14D are a diagram illustrating most of non-target bounding boxes that are out of the target 3D space are filtered out; FIGS. 15A to 15D are a diagram illustrating most of non-target bounding boxes that are out of the target 3D space are filtered out; FIG. 16 is a flowchart of an image process in the multi-viewpoint image processing system; FIGS. 17A and 17B are a diagram illustrating the proposed method can be applied in other tasks to exclude persons that are out of the target 3D space. FIGS. 18A and 18B are a diagram illustrating the proposed method can be applied in other tasks to exclude objects (e.g., basketball) that are out of the target 3D space. FIG. 19 is a hardware configuration diagram of an information processing apparatus. DESCRIPTION OF EMBODIMENTS When a bounding box is set as a region for performing a skeleton recognition of a human body, a gymnastic grading support system constituted by the skeleton recognition and a technique recognition has been attracting attention as an application field thereof. As a technique related to a sport, an adaptive appearance model for tracking a three dimensional robust object, a camera pose estimation technique for a sport stadium, an