US-12620121-B2 - Method of recognizing position and attitude of object, and non-transitory computer-readable storage medium

US12620121B2US 12620121 B2US12620121 B2US 12620121B2US-12620121-B2

Abstract

A method of the present disclosure includes (a) generating an input image by imaging a scene containing the M objects by a camera, (b) obtaining a feature map showing feature amounts relating to the N keypoints from the input image using a learned machine learning model with the input image as input and the feature map as output, (c) obtaining three-dimensional coordinates of the N keypoints belonging to each of the M objects using the feature map, and (d) determining positions and attitudes of one or more objects of the M objects using the three-dimensional coordinates of the N keypoints belonging to each of the M objects, wherein (c) includes (c1) obtaining M×N keypoints having undetermined correspondence relationships with the M objects and determining the three-dimensional coordinates of the M×N keypoints, and (c2) grouping the M×N keypoints to the N keypoints belonging to each of the M objects.

Inventors

Masaki Hayashi
Hirokazu Kasahara
Guoyi Fu
Zhongzhen LUO

Assignees

SEIKO EPSON CORPORATION

Dates

Publication Date: 20260505
Application Date: 20230828
Priority Date: 20220829

Claims (2)

1 . A method of recognizing a position and an attitude of an object using first to Nth N keypoints set for the object, M being an integer of 1 or more and N being an integer of 2 or more, comprising: (a) generating an input image by imaging a scene containing the M objects by a camera; (b) obtaining a feature map showing feature amounts relating to the N keypoints from the input image using a learned machine learning model with the input image as input and the feature map as output; (c) obtaining three-dimensional coordinates of the N keypoints belonging to each of the M objects using the feature map; and (d) determining positions and attitudes of one or more objects of the M objects using the three-dimensional coordinates of the N keypoints belonging to each of the M objects, wherein (c) includes: (c1) obtaining M×N keypoints having undetermined correspondence relationships with the M objects and determining the three-dimensional coordinates of the M×N keypoints; and (c2) grouping the M×N keypoints to the N keypoints belonging to each of the M objects, the feature map used at (c2) contains N directional vector maps as maps in which vectors indicating directions from a plurality of pixels belonging to a same object to an object keypoint are assigned to the plurality of pixels with each of the N keypoints as the object keypoint, and (c2) includes: (c2-1) selecting one ith keypoint from M ith keypoints and selecting one jth keypoint from M jth keypoints; (c2-2) calculating a first degree of conformance indicating a degree of coincidence of directions of a first vector obtained from a jth directional vector map and indicating a direction from a pixel position of the ith keypoint toward the jth keypoint and a second vector indicating a direction from a pixel position expressed by the three-dimensional coordinates of the ith keypoint to a pixel position expressed by the three-dimensional coordinates of the jth keypoint, i and j being integers from 1 to N different from each other; and (c2-3) repeating (c2-1) and (c2-2) and performing the grouping of the M×N keypoints according to the first degree of conformance, wherein (c2-2) further includes: (2a) calculating a second degree of conformance indicating a degree of coincidence of directions of a third vector obtained from an ith directional vector map and indicating a direction from a pixel position of the jth keypoint toward the ith keypoint and a fourth vector indicating a direction from a pixel position expressed by the three-dimensional coordinates of the jth keypoint to a pixel position expressed by the three-dimensional coordinates of the ith keypoint; and (2b) calculating an integrated degree of conformance by integration of the first degree of conformance and the second degree of conformance, and (c2-3) further executes the grouping according to the integrated degree of conformance, wherein the feature map used at (c2) further contains a field map showing whether pixels belong to a same object, and (c2-3) further includes: (3a) estimating that the ith keypoint and the jth keypoint do not belong to a same object when the integrated degree of conformance is lower than a threshold; (3b) estimating whether the ith keypoint and the jth keypoint belong to a same object using the field map when the integrated degree of conformance is equal to or higher than the threshold; (3c) adjusting the integrated degree of conformance to a first value when estimated that the ith keypoint and the jth keypoint do not belong to a same object and adjusting the integrated degree of conformance to a second value higher than the first value when estimated that the ith keypoint and the jth keypoint belong to a same object; (3d) selecting one arbitrary keypoint set including N keypoints from the first keypoint to the Nth keypoint from the M×N keypoints; (3e) calculating a set degree of conformance for the keypoint set by adding the integrated degrees of conformance for N (N−1)/2 keypoint pairs respectively formed by two arbitrary keypoints contained in the keypoint set; (3f) repeating (3d), ( 3 e ) and obtaining the set degrees of conformance for a plurality of the keypoint sets; and (3g) settling the grouping relating to the keypoint set in descending order of the set degree of conformance.
2 . A non-transitory computer-readable storage medium storing a computer program for controlling a processor to execute processing of recognizing a position and an attitude of an object using first to Nth N keypoints set for the object, M being an integer of 1 or more and N being an integer of 2 or more, the computer program for controlling the processor to execute: (a) processing of generating an input image by imaging a scene containing M objects by a camera; (b) processing of obtaining a feature map showing feature amounts relating to the N keypoints from the input image using a learned machine learning model with the input image as input and the feature map as output; (c) processing of obtaining three-dimensional coordinates of the N keypoints belonging to each of the M objects using the feature map; and (d) processing of determining positions and attitudes of one or more objects of the M objects using the three-dimensional coordinates of the N keypoints belonging to each of the M objects, wherein (c) includes: (c1) processing of obtaining M×N keypoints having undetermined correspondence relationships with the M objects and determining the three-dimensional coordinates of the M×N keypoints; and (c2) processing of grouping the M×N keypoints to the N keypoints belonging to each of the M objects, the feature map used at (c2) contains N directional vector maps as maps in which vectors indicating directions from a plurality of pixels belonging to a same object to an object keypoint are assigned to the plurality of pixels with each of the N keypoints as the object keypoint, and (c2) includes: (c2-1) processing of selecting one ith keypoint from M ith keypoints and selecting one jth keypoint from M jth keypoints; (c2-2) processing of calculating a first degree of conformance indicating a degree of coincidence of directions of a first vector obtained from a jth directional vector map and indicating a direction from a pixel position of the ith keypoint toward the jth keypoint and a second vector indicating a direction from a pixel position expressed by the three-dimensional coordinates of the ith keypoint to a pixel position expressed by the three-dimensional coordinates of the jth keypoint, i and j being integers from 1 to N different from each other; and (c2-3) processing of repeating (c2-1) and (c2-2) and performing the grouping of the M×N keypoints according to the first degree of conformance, wherein (c2-2) further includes: (2a) calculating a second degree of conformance indicating a degree of coincidence of directions of a third vector obtained from an ith directional vector map and indicating a direction from a pixel position of the jth keypoint toward the ith keypoint and a fourth vector indicating a direction from a pixel position expressed by the three-dimensional coordinates of the jth keypoint to a pixel position expressed by the three-dimensional coordinates of the ith keypoint; and (2b) calculating an integrated degree of conformance by integration of the first degree of conformance and the second degree of conformance, and (c2-3) further executes the grouping according to the integrated degree of conformance, wherein the feature map used at (c2) further contains a field map showing whether pixels belong to a same object, and (c2-3) further includes: (3a) estimating that the ith keypoint and the jth keypoint do not belong to a same object when the integrated degree of conformance is lower than a threshold; (3b) estimating whether the ith keypoint and the jth keypoint belong to a same object using the field map when the integrated degree of conformance is equal to or higher than the threshold; (3c) adjusting the integrated degree of conformance to a first value when estimated that the ith keypoint and the jth keypoint do not belong to a same object and adjusting the integrated degree of conformance to a second value higher than the first value when estimated that the ith keypoint and the jth keypoint belong to a same object; (3d) selecting one arbitrary keypoint set including N keypoints from the first keypoint to the Nth keypoint from the M×N keypoints; (3e) calculating a set degree of conformance for the keypoint set by adding the integrated degrees of conformance for N (N−1)/2 keypoint pairs respectively formed by two arbitrary keypoints contained in the keypoint set; (3f) repeating (3d), (3e) and obtaining the set degrees of conformance for a plurality of the keypoint sets; and (3g) settling the grouping relating to the keypoint set in descending order of the set degree of conformance.

Description

The present application is based on, and claims priority from JP Application Serial Number 2022-135489, filed Aug. 29, 2022, the disclosure of which is hereby incorporated by reference herein in its entirety. BACKGROUND 1. Technical Field The present disclosure relates to a method of recognizing a position and an attitude of an object, and a non-transitory computer-readable storage medium. 2. Related Art Feedback Control for Category-Level Robotic Manipulation (IEEE Robotics and Automation Letter, arXiv:2102.06279v1) and An Affordance Keypoint Detection Network for Robot Manipulation (IEEE Robotics and Automation Letters, Volume: 6, Issue: 2, April 2021) disclose techniques of recognizing a position and an attitude of an object by estimating a plurality of characteristic keypoints preset for the object in a neural network from an image of the object. In the techniques disclosed in Feedback Control for Category-Level Robotic Manipulation and An Affordance Keypoint Detection Network for Robot Manipulation, the position and the attitude of the object may be robustly recognized to some degree for changes in image. However, in learning of the neural network, in order to robustly recognize the object even when conditions including the shape and the surrounding environment of the object vary, it is necessary to prepare vast amounts of learning data corresponding to combinations of those various conditions. Further, manual attachments of correct labels vary from person to person, and there is a problem that errors are introduced into the correct labels and degradation in recognition accuracy is caused. Furthermore, a single object is assumed in related art, and there is a problem that it is impossible to recognize positions and attitudes of a plurality of objects. Accordingly, it is desired to solve at least part of these problems. SUMMARY According to a first aspect of the present disclosure, a method of learning a machine learning model used for recognition of a position and an attitude of an object imaged by a camera using a plurality of keypoints set for the object is provided. The method includes (a) generating a plurality of learning object models in which at least part of a shape and a surface property of the object is changed using basic shape data of the object, (b) generating a plurality of scenes in which part or all of the plurality of learning object models are placed in an environment in which the object is to be placed by simulations and generating a plurality of simulation images which are to be obtained by imaging of the respective plurality of scenes by the camera, (c) generating a correct feature map showing correct values of feature amounts relating to the plurality of keypoints to correspond to each of the plurality of simulation images, and (d) learning the machine learning model for estimation of a feature map from an input image captured by the camera using the plurality of simulation images and a plurality of the correct feature maps as teacher data. According to a second aspect of the present disclosure, a non-transitory computer-readable storage medium storing a computer program for controlling a processor to execute processing of learning a machine learning model used for recognition of a position and an attitude of an object imaged by a camera using a plurality of keypoints set for the object is provided. The computer program is for controlling the processor to execute (a) processing of generating a plurality of learning object models in which at least part of a shape and a surface property of the object is changed using basic shape data of the object, (b) processing of generating a plurality of scenes in which part or all of the plurality of learning object models are placed in an environment in which the object is to be placed by simulations and generating a plurality of simulation images which are to be obtained by imaging of the respective plurality of scenes by the camera, (c) processing of generating a correct feature map showing correct values of feature amounts relating to the plurality of keypoints to correspond to each of the plurality of simulation images, and (d) processing of learning the machine learning model for estimation of a feature map from an input image captured by the camera using the plurality of simulation images and a plurality of the correct feature maps as teacher data. According to a third aspect of the present disclosure, a method of recognizing a position and an attitude of an object using first to Nth N keypoints set for the object, M being an integer of 1 or more and N being an integer of 2 or more, is provided. The method includes (a) generating an input image by imaging a scene containing the M objects by a camera, (b) obtaining a feature map showing feature amounts relating to the N keypoints from the input image using a learned machine learning model with the input image as input and the feature map as output, (c) obtaining three-dimensional coordina