US-12626489-B2 - System and method using pyramidal and uniqueness matching priors for identifying correspondences between images

US12626489B2US 12626489 B2US12626489 B2US 12626489B2US-12626489-B2

Abstract

A computer-implemented method includes: obtaining a pair of images depicting a same scene, the pair of images including a first image with a first pixel grid and a second image with a second pixel grid, the first pixel grid different than the second pixel grid; by a neural network module having a first set of parameters: generating a first feature map based on the first image; and generating a second feature map based on the second image; determining a first correlation volume based on the first and second feature maps; iteratively determining a second correlation volume based on the first correlation volume; determining a loss for the first and second feature maps based on the second correlation volume; generating a second set of the parameters based on minimizing a loss function using the loss; and updating the neural network module to include the second set of parameters.

Inventors

Jérome REVAUD
Vincent Leroy
Philippe Weinzaepfel
Boris Chidlovskii

Assignees

NAVER CORPORATION

Dates

Publication Date: 20260512
Application Date: 20230130
Priority Date: 20220328

Claims (20)

1 . A computer-implemented method comprising: obtaining a pair of images depicting a same scene, the pair of images including a first image with a first pixel grid and a second image with a second pixel grid, wherein the first pixel grid is different than the second pixel grid; by a neural network module having a first set of parameters: generating a first feature map based on the first image; and generating a second feature map based on the second image, the first feature map including a first grid of image descriptors and the second feature map including a second grid of image descriptors, wherein each image descriptor in the first grid corresponds to a respective pixel within the first pixel grid and each local image descriptor in the second grid corresponds to a respective pixel within the second pixel grid; determining a first correlation volume based on the first and second feature maps, wherein the first correlation volume includes correlations of (a) pixels of the first pixel grid with (b) pixels of the second pixel grid, wherein each correlation between a pixel of the first pixel grid and a pixel of the second pixel grid is determined based on the image descriptors corresponding to the correlated pixels; iteratively determining a second correlation volume based on the first correlation volume; determining a loss for the first and second feature maps based on the second correlation volume; generating a second set of the parameters for the neural network module based on minimizing a loss function using the loss; and updating the neural network module to include the second set of parameters thereby generating a trained neural network module.
2 . The method of claim 1 , further comprising: by the trained neural network module: generating a third feature map based on a third image, the third feature map including a third grid of image descriptors; and generating a fourth feature map based on a fourth image, the fourth feature map including a fourth grid of image descriptors; and based on the third and fourth grids, identifying a first portion of the third image that corresponds to a second portion of the fourth image.
3 . The method of claim 1 wherein the second image is a synthetic version of the first image generated via data augmentation, and wherein iteratively determining the second correlation volume includes determining the second correlation volume using iterative pyramid construction.
4 . The method of claim 3 further comprising: determining a second loss based on the first grid of image descriptors, the second grid of image descriptors, and ground-truth correspondences between the first and second images, wherein generating the second set of the parameters for the neural network module includes generating the second set of parameters for the neural network module based on minimizing a loss function using the loss and the second loss.
5 . The method of claim 4 wherein generating the second set of the parameters for the neural network module includes generating the second set of parameters for the neural network module based on minimizing a loss function based on a sum of the loss and the second loss.
6 . The method of claim 5 wherein the sum is a weighted sum.
7 . The method of claim 1 , wherein determining the second correlation volume includes: generating a first level correlation volume based on first-level correlations between first-level patches of the first pixel grid and first-level patches of the second pixel grid; and for N between 1 and L−1, iteratively aggregating N+1 level correlations of an nth level correlation volume to N+1 level correlations between N+1 level patches of the first pixel grid and N+1 level patches of the second pixel grid.
8 . The method of claim 7 wherein the N+1 level patches including neighboring N level patches of the respective pixel grid and the aggregated N level correlations correspond to the neighboring N-level patches of the correlated N+1 level patches.
9 . The method of claim 7 , wherein generating the first level correlation volume includes determining a first-level correlations between a first-level patch of the first pixel grid and a first-level patch of the second pixel grid as an averaged sum of correlations between corresponding pixels in the first-level patch of the first pixel grid and the first-level patch of the second pixel grid.
10 . The method of claim 7 , wherein each N+1 level patch includes 2×2 n-level patches of the respective pixel grid.
11 . The method of claim 7 , wherein determining the N-th level correlation volume includes performing a rectification transformation for each N-level correlation of the N-th level correlation volume.
12 . The method of claim 1 , wherein the first correlation volume has a first dimension corresponding to a first dimension of the first feature map and a second dimension corresponding to a second dimension of the first feature map.
13 . The method of claim 1 wherein generating the first correlation volume includes subsampling the first feature map by a predetermined factor in the first and second dimensions and generating a subsampled feature map having a third dimension that is less than the first dimension and a fourth dimension that is less than the second dimension.
14 . The method of claim 13 , wherein subsampling the first feature map includes: dividing the first pixel grid in non-overlapping patches, each patch including a plurality of pixels; and for each patch, determining one descriptor based on the image descriptors corresponding to the pixels of that patch, wherein the one descriptor represents all pixels of that patch in the subsampled feature map.
15 . The method of claim 14 , wherein determining the first correlation volume includes for each patch of the first pixel grid, determining correlations of the patch with each pixel of the second pixel grid, each correlation being determined based on the one descriptor representing the respective patch in the subsampled feature map and the one descriptor of the second feature map corresponding to the correlated pixel of the second pixel grid.
16 . The method of claim 14 , wherein each patch has a size of 4×4 pixels, the first dimension is 4× the third dimension, and the second dimension is 4× the fourth dimension.
17 . The method of claim 1 further comprising: using the trained neural network module and the second set of parameters, extracting correspondences between portions of a second image pair; based on the extracted correspondences, determining whether the portions of the second image pair include the same scene; and outputting an indicator of whether the portions of the second image pair include the same scene.
18 . A system comprising: a neural network module configured to, using trainable parameters: generate a first feature map based on a first image of an image pair; and generate a second feature map based on a second image of the image pair, the first feature map including a first grid of image descriptors and the second feature map including a second grid of image descriptors, at least a portion of the first image including a scene and at least a portion of the second image including the scene; a correlation module configured to determine a loss based on the first and second feature maps; and a training module configured to train the trainable parameters based on minimizing the loss.
19 . The system of claim 18 wherein the training module is configured to train the trainable parameters without labels indicative of correspondences between the portions of the first and second images.
20 . The system of claim 18 further comprising a matching module configured to, after the training: extract correspondences between feature maps generated by the neural network module based on received images, respectively; based on the correspondences, determine whether the received images include the same scene; and output an indicator of whether the received images include the same scene.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of European Application No.: EP22305384.4 filed on Mar. 28, 2022. The entire disclosure of the application referenced above is incorporated herein by reference. FIELD The present disclosure relates to neural networks for image analysis and more particularly to systems, methods, and computer-readable medium for unsupervised learning of neural networks for computing local image descriptors for determining whether a pair of images depict the same scene. BACKGROUND The background description provided here is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure. One of the persistent challenges of computer-based image analysis is the identification of corresponding pixels or portions of two images. This image analysis problem is central to numerous computer vision tasks, such as large-scale visual localization, object detection, pose estimation, Structure-from-Motion (SfM), three dimensional (3D) reconstruction and Simultaneous Localization and Mapping (SLAM). All of these tasks involve identifying corresponding portions of two images depicting at least partially the same visual content. Two different images representing the same visual content can differ in a wide range of parameters, such as the viewing angle of the depicted motif, position of the motif within the image frame, the camera, lens, and sensor type used to capture the image, lighting and weather conditions, focal length, and/or sharpness, to name just a few. While a human can, even for images that differ markedly from each other, easily identify image parts that show the same feature of a depicted object or person, this task is actually rather complex for computers due to differences in geometry, color and contrast between the images. Correspondences between images are local image descriptors (also called pixel descriptors) may be identified and extracted from an image. A local image descriptor may characterize a neighborhood of a pixel of the image and provide a computer-processable data structure that enables a computing system to compare local environments of pixels by determining correlations between pixels based on their local image descriptors. Local image descriptors can be extracted from an image sparsely (e.g., only for selected keypoints of the image) or densely (e.g., for each pixel of the image). Various implementations a process of extracting (i.e. computing) local image descriptors from an image is performed for the quality of the identified correspondences. Existing learning-based approaches for extracting local image descriptors may significantly outperform standard handcrafted methods. Learning-based approaches may be based on training procedures involving the existence of annotated training data sets that include a large number of image pairs for which pixel-level correspondences (e.g., dense ground-truth correspondences). These correspondences may be obtained by considering a large collection of images for a given landmark and building a Structure-from-Motion reconstruction. This pipeline may fail though, yielding a bottleneck to the kind of ground-truth data that can be generated. This may limit the potential of available image data to only those image pairs for which ground-truth labels can be efficiently derived. Since limited training data sets have a direct negative impact on the training results, it would be beneficial to overcome these restrictions in order to fully exploit the potential of available image pairs as training data sets. SUMMARY In order to overcome the above deficiencies, a computer-implemented method for learning local descriptors without supervision is presented. The approach of the method is to jointly enforce two matching priors: local consistency and uniqueness of the matching. The former is based on the observation that two neighboring pixels of one image will likely match with two pixels forming a similar neighboring pair in the other image, up to a small deformation. This may generally hold true at any scale. In disclosed examples, this prior is efficiently enforced through a pyramidal structure. A pyramidal non-parametric module extracts higher-level correspondences, enforcing the local consistency matching prior by design. The uniqueness prior is based on the observation that one pixel from the first image can correspond to at most one pixel in the second image. This property is enforced on high-level correspondences via a uniqueness matching loss. It is naturally propagated to low-level pixel correspondences, thanks to the pyramidal hierarchical structure. As a result, the combination of a local consistency prior and a