US-12626376-B2 - Apparatus and method with image matching

US12626376B2US 12626376 B2US12626376 B2US 12626376B2US-12626376-B2

Abstract

A processor-implemented method includes: extracting feature maps respectively corresponding to a plurality of channels based on a convolutional network with respect to two images; generating a matching point map from the feature maps; refining the matching point map by using attention between matching points comprised in the matching point map; and extracting a matching point between the two images from the refined matching point map.

Inventors

Seung Wook Kim
Minsu Cho
Juhong MIN

Assignees

SAMSUNG ELECTRONICS CO., LTD.
POSTECH Research and Business Development Foundation

Dates

Publication Date: 20260512
Application Date: 20230512
Priority Date: 20221111

Claims (20)

1 . A processor-implemented method, the method comprising: extracting feature maps respectively corresponding to a plurality of channels based on a convolutional network with respect to two images; generating a matching point map from the feature maps; refining the matching point map by using attention between matching points comprised in the matching point map through an attention structure based on addition between layers comprised in a transformer neural network; and extracting a matching point between the two images from the refined matching point map.
2 . The method of claim 1 , wherein the refining the matching point map through the attention structure based on the addition between the layers comprised in the transformer neural network comprises: determining a first global vector of one dimension from a query layer; determining a first relation layer for all key vectors of a key layer by performing addition-based attention on the first global vector and all the key vectors of the key layer; determining a second global vector of one dimension from an intermediate layer; and determining a second relation layer for all value vectors of a value layer by performing addition-based attention on the second global vector and all the value vectors of the value layer.
3 . The method of claim 2 , further comprising performing addition of all query vectors of the query layer and all vectors of the second relation layer.
4 . The method of claim 1 , wherein the generating the matching point map comprises generating the matching point map of four dimension for the feature maps by using the feature maps of two dimension.
5 . The method of claim 1 , wherein the refining the matching point map comprises: converting the matching point map to a vector of one dimension; and calculating a similarity by performing attention on the matching points.
6 . The method of claim 1 , wherein the refining the matching point map comprises increasing a size of the refined matching point map by upsampling the refined matching point map.
7 . The method of claim 1 , wherein the extracting the feature maps comprises extracting the feature maps corresponding to a result of a bottleneck layer of the convolutional network.
8 . The method of claim 1 , wherein the extracting the matching point comprises extracting the matching point between the two images by calculating a dense flow field by using the refined matching point map.
9 . The method of claim 1 , further comprising training a transformer neural network used for the refining of the matching point map by using a loss function for the extracted matching point and a labeled matching point between the two images.
10 . A non-transitory computer-readable storage medium storing instructions that, when executed by one or more processors, configure the one or more processors to perform the method of claim 1 .
11 . The method of claim 1 , wherein the refining the matching point map through the attention structure is based on the addition between two different layers comprised in the transformer neural network.
12 . A processor-implemented method, the method comprising: extracting feature maps respectively corresponding to a plurality of channels based on a convolutional network with respect to two training images; generating a matching point map from the feature maps; refining the matching point map through an attention structure based on addition between layers comprised in a transformer neural network; extracting a matching point between the two training images from the refined matching point map; and training the transformer neural network for refining the matching point map by using a loss function for the extracted matching point and a labeled matching point between the two training images.
13 . The method of claim 12 , wherein the refining the matching point map through the attention structure based on the addition between the layers comprised in the transformer neural network comprises: determining a first global vector of one dimension from a query layer; determining a first relation layer for all key vectors of a key layer by performing addition-based attention on the first global vector and all the key vectors of the key layer; determining a second global vector of one dimension from an intermediate layer; and determining a second relation layer for all value vectors of a value layer by performing addition-based attention on the second global vector and all the value vectors of the value layer.
14 . The method of claim 13 , further comprising performing addition of all query vectors of the query layer and all vectors of the second relation layer.
15 . The method of claim 12 , wherein the generating the matching point map from the feature maps comprises generating the matching point map of four dimension for the feature maps by using the feature maps of two dimension.
16 . The method of claim 12 , wherein the refining the matching point map by using the attention between the matching points comprised in the matching point map further comprises: converting the matching point map to a vector of one dimension; and calculating a similarity by performing attention on the matching points.
17 . The method of claim 12 , wherein the refining the matching point map through the attention structure is based on the addition between two different layers comprised in the transformer neural network.
18 . An apparatus comprising: one or more processors configured to: extract feature maps respectively corresponding to a plurality of channels based on a convolutional network with respect to the two training images; generate a matching point map from the feature maps; refine the matching point map by using attention between matching points comprised in the matching point map through an attention structure based on addition between layers comprised in a transformer neural network; and extract the matching point between the two images from the refined matching point map.
19 . The apparatus of claim 18 , wherein, for the refining the matching point map through the attention structure based on the addition between the layers comprised in the transformer neural network, the one or more processors are configured to: determine a first global vector of one dimension from a query layer; determine a first relation layer for all key vectors of a key layer by performing addition-based attention on the first global vector and all the key vectors of the key layer; determine a second global vector of one dimension from an intermediate layer; and determine a second relation layer for all value vectors of a value layer by performing addition-based attention on the second global vector and all the value vectors of the value layer.
20 . The apparatus of claim 19 , wherein the one or more processors are configured to perform addition of all query vectors of the query layer and all vectors of the second relation layer.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit under 35 USC § 119(a) of Korean Patent Application No. 10-2022-0150904, filed on Nov. 11, 2022 in the Korean Intellectual Property Office, the entire disclosure of which is incorporated herein by reference for all purposes. BACKGROUND 1. Field The following description relates to an apparatus and method with image matching. 2. Description of Related Art Image matching technology may identify matching parts of two images. When the two images capture the same scene or the same object in different conditions (e.g., illuminance, an angle, etc.), the technology may be referred to as wide-baseline matching. When the two images capture different instances of an object in the same class, the technology may be referred to as semantic matching. A cosine similarity of extracted features may be basically used to find matching points of the two images. The image matching technology aims to obtain a high cosine similarity between matching points and obtain a low cosine similarity between unmatched points. Accordingly, a trained convolutional neural network (CNN) in an ImageNet dataset may be used to extract the features of an image in an image matching field. SUMMARY This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. In one or more general aspects, a processor-implemented method includes: extracting feature maps respectively corresponding to a plurality of channels based on a convolutional network with respect to two images; generating a matching point map from the feature maps; refining the matching point map by using attention between matching points comprised in the matching point map; and extracting a matching point between the two images from the refined matching point map. The refining the matching point map may include refining the matching point map through an attention structure based on addition between layers comprised in a transformer neural network. The refining the matching point map through the attention structure based on the addition between the layers comprised in the transformer neural network may include: determining a first global vector of one dimension from a query layer; determining a first relation layer for all key vectors of a key layer by performing addition-based attention on the first global vector and all the key vectors of the key layer; determining a second global vector of one dimension from an intermediate layer; and determining a second relation layer for all value vectors of a value layer by performing addition-based attention on the second global vector and all the value vectors of the value layer. The method may include performing addition of all query vectors of the query layer and all vectors of the second relation layer. The generating the matching point map may include generating the matching point map of four dimension for the feature maps by using the feature maps of two dimension. The refining the matching point map may include: converting the matching point map to a vector of one dimension; and calculating a similarity by performing attention on the matching points. The refining the matching point map may include increasing a size of the refined matching point map by upsampling the refined matching point map. The extracting the feature maps may include extracting the feature maps corresponding to a result of a bottleneck layer of the convolutional network. The extracting the matching point may include extracting the matching point between the two images by calculating a dense flow field by using the refined matching point map. The method may include training a transformer neural network used for the refining of the matching point map by using a loss function for the extracted matching point and a labeled matching point between the two images. In one or more general aspects, a non-transitory computer-readable storage medium stores instructions that, when executed by a processor, configure the processor to perform any one, any combination, or all of operations and/or methods described herein. In one or more general aspects, a processor-implemented method includes: extracting feature maps respectively corresponding to a plurality of channels based on a convolutional network with respect to two training images; generating a matching point map from the feature maps; refining the matching point map through an attention structure based on addition between layers comprised in a transformer neural network; extracting a matching point between the two training images from the refined matching point map; and training the transformer neural network for refining the matching point map by using a loss function for the extra