US-12620109-B2 - Learning reliable keypoints in situ with introspective self-supervision

US12620109B2US 12620109 B2US12620109 B2US 12620109B2US-12620109-B2

Abstract

An apparatus to facilitate learning reliable keypoints in situ with introspective self-supervision is disclosed. The apparatus includes one or more processors to provide a view-overlapped keyframe pair from a pose graph that is generated by a visual simultaneous localization and mapping (VSLAM) process executed by the one or more processors; determine a keypoint match from the view-overlapped keyframe pair based on a keypoint detection and matching process, the keypoint match corresponding to a keypoint; calculate an inverse reliability score based on matched pixels corresponding to the keypoint match in the view-overlapped keyframe pair; identify a supervision signal associated with the keypoint match, the supervision signal comprising a keypoint reliability score of the keypoint based on a final pose output of the VSLAM process; and train a keypoint detection neural network using the keypoint match, the inverse reliability score, and the keypoint reliability score.

Inventors

Xuesong Shi
Sangeeta Manepalli
Rita Chattopadhyay
Peng Wang
Yimin Zhang

Assignees

INTEL CORPORATION

Dates

Publication Date: 20260505
Application Date: 20210923

Claims (17)

1 . An apparatus comprising: one or more processors to: provide a view-overlapped keyframe pair from a pose graph that is generated by a visual simultaneous localization and mapping (VSLAM) process executed by the one or more processors, wherein the view-overlapped keyframe pair comprises a matched pixel set (p, p′); determine a keypoint match from the view-overlapped keyframe pair based on a keypoint detection and matching process, the keypoint match corresponding to a keypoint; calculate an inverse reliability score based on matched pixels corresponding to the keypoint match in the view-overlapped keyframe pair, wherein the inverse reliability score comprises a pixel distance between p′ and an epipolar line for p; identify a supervision signal associated with the keypoint match, the supervision signal comprising a keypoint reliability score of the keypoint based on a final pose output of the VSLAM process; and train a keypoint detection neural network using the keypoint match, the inverse reliability score, and the keypoint reliability score.
2 . The apparatus of claim 1 , wherein the view-overlapped keyframe pair comprises a pair of image frames captured by a camera, and wherein the keypoint match corresponds to the keypoint that is present in each of the image frames in the view- overlapped keyframe pair.
3 . The apparatus of claim 2 , wherein the keypoint comprises a landmark in a scene of the view-overlapped keyframe pair.
4 . The apparatus of claim 1 , wherein the keypoint detection neural network comprises a convolutional neural network (CNN).
5 . The apparatus of claim 1 , wherein the keypoint reliability score is based on a comparison of coordinates of the keypoint in the final pose output generated by the VSLAM process to saved coordinates for a scene of the view-overlapped keyframe pair.
6 . The apparatus of claim 1 , wherein the one or more processors are further to: regress the inverse reliability score into a regressed inverse reliability score; train a separate head of the keypoint detection neural network with the regressed inverse reliability score; and combine the regressed inverse reliability score with the keypoint reliability score to obtain a final keypoint reliability score.
7 . The apparatus of claim 1 , wherein the apparatus comprises a robot utilizing the VSLAM process for localization of the robot.
8 . The apparatus of claim 1 , wherein the one or more processors comprise one or more of a graphics processor, an application processor, and another processor, wherein the one or more processors are co-located on a common semiconductor package.
9 . A non-transitory computer-readable storage medium having stored thereon executable computer program instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising: providing a view-overlapped keyframe pair from a pose graph that is generated by a visual simultaneous localization and mapping (VSLAM) process executed by the one or more processors, wherein the view-overlapped keyframe pair comprises a matched pixel set (p, p′); determining a keypoint match from the view-overlapped keyframe pair based on a keypoint detection and matching process, the keypoint match corresponding to a keypoint; calculating an inverse reliability score based on matched pixels corresponding to the keypoint match in the view-overlapped keyframe pair, wherein the inverse reliability score comprises a pixel distance between p′ and an epipolar line for p; identifying a supervision signal associated with the keypoint match, the supervision signal comprising a keypoint reliability score of the keypoint based on a final pose output of the VSLAM process; and training a keypoint detection neural network using the keypoint match, the inverse reliability score, and the keypoint reliability score.
10 . The non-transitory computer-readable storage medium of claim 9 , wherein the view-overlapped keyframe pair comprises a pair of image frames captured by a camera, and wherein the keypoint match corresponds to the keypoint that is present in each of the image frames in the view-overlapped keyframe pair.
11 . The non-transitory computer-readable storage medium of claim 9 , wherein the keypoint detection neural network comprises a convolutional neural network (CNN).
12 . The non-transitory computer-readable storage medium of claim 9 , wherein the keypoint reliability score is based on a comparison of coordinates of the keypoint in the final pose output generated by the VSLAM process to saved coordinates for a scene of the view-overlapped keyframe pair.
13 . The non-transitory computer-readable storage medium of claim 9 , wherein the operations further comprise: regressing the inverse reliability score into a regressed inverse reliability score; training a separate head of the keypoint detection neural network with the regressed inverse reliability score; and combining the regressed inverse reliability score with the keypoint reliability score to obtain a final keypoint reliability score to utilize during an inference stage of the keypoint detection neural network.
14 . A method for facilitating learning reliable keypoints in situ with introspective self-supervision, the method comprising: providing a view-overlapped keyframe pair from a pose graph that is generated by a visual simultaneous localization and mapping (VSLAM) process executed by one or more processors, wherein the view-overlapped keyframe pair comprises a matched pixel set (p, p′); determining a keypoint match from the view-overlapped keyframe pair based on a keypoint detection and matching process, the keypoint match corresponding to a keypoint; calculating an inverse reliability score based on matched pixels corresponding to the keypoint match in the view-overlapped keyframe pair, wherein the inverse reliability score comprises a pixel distance between p′ and an epipolar line for p; identifying a supervision signal associated with the keypoint match, the supervision signal comprising a keypoint reliability score of the keypoint based on a final pose output of the VSLAM process; and training a keypoint detection neural network using the keypoint match, the inverse reliability score, and the keypoint reliability score.
15 . The method of claim 14 , wherein the view-overlapped keyframe pair comprises a pair of image frames captured by a camera, and wherein the keypoint match corresponds to the keypoint that is present in each of the image frames in the view- overlapped keyframe pair.
16 . The method of claim 14 , wherein the keypoint detection neural network comprises a convolutional neural network (CNN).
17 . The method of claim 14 , wherein the keypoint reliability score is based on a comparison of coordinates of the keypoint in the final pose output generated by the VSLAM process to saved coordinates for a scene of the view-overlapped keyframe pair.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS The present application is a 35 U.S.C. § 371 National Stage Patent Application of International Patent Application PCT/CN2021/119869, filed on Sep. 23, 2021. FIELD Embodiments relate generally to data processing and more particularly to learning reliable keypoints in situ with introspective self-supervision. BACKGROUND OF THE DESCRIPTION Neural networks and other types of machine learning models are useful tools that have demonstrated their value solving complex problems regarding pattern recognition, natural language processing, automatic speech recognition, etc. Neural networks operate using artificial neurons arranged into one or more layers that process data from an input layer to an output layer, applying weighting values to the data during the processing of the data. Such weighting values are determined during a training process and applied during an inference process. One example application for machine learning models is in the technology of autonomous robot localization. One technique used in robot localization is visual simultaneous localization and mapping (VSLAM). When implementing VSLAM, the robustness and accuracy of the robot's navigation is based on the number and quality of matchable keypoints from each image. Recent research shows that deep learning (DL)-based keypoint detection outperforms traditional keypoint detection techniques for VSLAM. Another advantage is that DL-based detection can provide is that it can be tuned to the specific environment the robots are working in. As such, creating supervision for the training with unlabeled images, especially when the robot is deployed, would contribute to the performance of DL-based keypoint detection used in autonomous robot localization. BRIEF DESCRIPTION OF THE DRAWINGS So that the manner in which the above recited features of the present embodiments can be understood in detail, a more particular description of the embodiments, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate typical embodiments and are therefore not to be considered limiting of its scope. The figures are not to scale. In general, the same reference numbers are used throughout the drawing(s) and accompanying written description to refer to the same or like parts. FIG. 1 is a block diagram of an example computing system that may be used to provide learning reliable keypoints in situ with introspective self-supervision, according to implementations of the disclosure. FIG. 2 illustrates a machine learning software stack, according to an embodiment. FIGS. 3A-3B illustrate layers of example deep neural networks. FIG. 4 illustrates an example recurrent neural network. FIG. 5 illustrates training and deployment of a deep neural network. FIG. 6 is a block diagram depicting an example autonomous navigational system for learning reliable keypoints in situ with introspective self-supervision of implementations of the disclosure. FIG. 7 is a block diagram depicting a training system executed on a cloud or edge server for learning reliable keypoints in situ with introspective self-supervision, in accordance with implementations of the disclosure. FIG. 8 is block diagram illustrating a schematic of a keyframe image pair used for inverse reliability score determination, in accordance with implementations of the disclosure. FIG. 9 is a flowchart representative of machine-readable instructions with may be executed to implement learning reliable keypoints in situ with introspective self-supervision, in accordance with implementations of the disclosure. FIG. 10 is a flow diagram illustrating an embodiment of a method for training a neural network for learning reliable keypoints in situ with introspective self-supervision, in accordance with implementations herein. FIG. 11 is a flow diagram illustrating an embodiment of a method for inference using a neural network trained for learning reliable keypoints in situ with introspective self-supervision, in accordance with implementations herein. FIG. 12 is a schematic diagram of an illustrative electronic computing device to enable learning reliable keypoints in situ with introspective self-supervision, according to some embodiments. DETAILED DESCRIPTION Implementations of the disclosure describe learning reliable keypoints in situ with introspective self-supervision. In computer engineering, computing architecture is a set of rules and methods that describe the functionality, organization, and implementation of computer systems. Today's computing systems are expected to deliver near zero-wait responsiveness and superb performance while taking on large workloads for execution. Therefore, computing architectures have continually changed (e.g., improved) to accommodate demanding workloads and increased performance expectations. Examples of large workloads include neural net