KR-20260063652-A - DEVICE AND METHOD FOR OBJECT-CENTERED REPRESENTATION LEARNING THROUGH UNSUPERVISED SEMANTIC SEGMENTATION

KR20260063652AKR 20260063652 AKR20260063652 AKR 20260063652AKR-20260063652-A

Abstract

The present invention relates to an object-centered representation learning device through unsupervised semantic segmentation, comprising: an image encoding module that receives an input image and generates a feature map; an eigenvector module that calculates eigenvectors representing the semantic structure of patches within the input image based on color affinity and semantic similarity of the input image and generates patch clusters for patches within the input image through the eigenvectors; and an object-centered contrast learning module that generates object prototypes based on the patch clusters and distinguishes objects within the input image through semantic consistency based on contrast learning for the object prototypes.

Inventors

황성재
김찬영
한우정
주다윤

Assignees

연세대학교 산학협력단

Dates

Publication Date: 20260507
Application Date: 20241030

Claims (11)

An image encoding module that receives an input image and generates a feature map; An eigencluster module that calculates eigenvectors representing the semantic structure of patches within the input image based on the color affinity and semantic similarity of the input image, and generates patch clusters for the patches within the input image through the eigenvectors; and An object-centered representation learning device through unsupervised semantic segmentation comprising an object-centered contrastive learning module that generates an object prototype based on the patch cluster and distinguishes objects within the input image through semantic consistency based on contrastive learning for the object prototype.
In claim 1, the image encoding module is An object-centered representation learning device through unsupervised semantic segmentation, characterized by receiving an original image and a transformed image obtained by transforming the original image through a Vision Transformer (ViT) as input images.
In paragraph 2, the image encoding module is An object-oriented representation learning device through unsupervised semantic segmentation characterized by extracting key features of different layers from the original image and the modified image and integrating the key features to generate the feature map.
In paragraph 1, the eigencluster module is An object-oriented representation learning device through unsupervised semantic segmentation, characterized by dividing the input image into patches and generating a color affinity matrix by calculating color affinity based on the color information of each of the patches.
In paragraph 4, the above-mentioned eigencluster module is An object-oriented representation learning device through unsupervised semantic segmentation, characterized by generating a semantic similarity matrix indicating how semantically similar each patch is by performing an inner product between patches in the above feature map.
In paragraph 5, the above-mentioned eigencluster module is An object-centered representation learning device through unsupervised semantic segmentation, characterized by generating a Laplacian matrix by merging the color affinity matrix and the semantic similarity matrix, and calculating the eigenvectors by eigendecomposing the Laplacian matrix.
In paragraph 6, the above-mentioned eigencluster module is An object-centered representation learning device through unsupervised semantic segmentation, characterized by performing K-means clustering on patches within the input image using the eigenvectors and classifying similar patches as the same object to generate the patch cluster (EiCue).
In paragraph 1, the object-centered contrast learning module is An object-centered representation learning device through unsupervised semantic segmentation characterized by determining the object prototype by selecting a center vector from the patch cluster or calculating an average vector.
In paragraph 8, the object-centered contrast learning module is An object-centered representation learning device through unsupervised semantic segmentation, characterized by learning the semantic consistency of the object by performing intra-image contrast learning and inter-image contrast learning on the object prototype.
In Clause 9, the object-centered contrast learning module is An object-centered representation learning device through unsupervised semantic segmentation characterized by learning the semantic distinction of the object through contrastive learning between the patch clusters.
In a method for learning object-oriented representations through unsupervised semantic segmentation performed in an object-oriented representation learning device through unsupervised semantic segmentation, Image encoding step that receives an input image and generates a feature map; An eigenclustering step of calculating eigenvectors representing the semantic structure of patches within the input image based on the color affinity and semantic similarity of the input image, and generating patch clusters for patches within the input image through the eigenvectors; and An object-centered representation learning method through unsupervised semantic segmentation, comprising an object-centered contrastive learning step of generating an object prototype based on the patch cluster and distinguishing objects within the input image through semantic consistency based on contrastive learning for the object prototype.

Description

Device and Method for Object-Centered Representation Learning Through Unsupervised Semantic Segmentation The present invention relates to an object-centered representation learning technique through unsupervised semantic segmentation, and more specifically, to an apparatus and method for object-centered representation learning through unsupervised semantic segmentation capable of generating object prototypes based on patch clusters and performing object-centered contrastive learning to distinguish objects within an input image through semantic consistency based on contrastive learning for object prototypes. Object-Centered Representation Learning (OCR) is a learning method that identifies the features of specific objects within a scene or image and understands the relationships and compositions between objects. This technology plays a crucial role in understanding objects concretely and contextually, particularly in various artificial intelligence fields such as computer vision, autonomous driving, and robotics. The key elements are as follows. Through Object Detection and Segmentation, objects are distinguished from the background to divide the image into multiple parts, and the location and size of each object are defined. Through this process, the model learns the boundaries and shapes of specific objects within the scene. Through representation learning, the key features of each object (color, shape, size, etc.) and their contextual relationships with the surrounding environment are identified and represented in vector form. This vector representation helps the model understand both the visual features and semantic information of objects, enabling consistent object recognition across various scenes. Unsupervised learning analyzes scenes without prior labels to learn similarities and relationships between objects. Like human observation, this is useful for identifying patterns in data and expanding the understanding of new objects. Through Fine-Grained Feature Analysis, it learns the detailed characteristics of objects and the correlations between them to accurately understand objects even in complex scenes. For example, in autonomous driving, it clearly distinguishes vehicles, pedestrians, and traffic lights on the road and identifies the relationships between them in real time. These object-oriented representation learning technologies are being applied to significantly improve object recognition accuracy in areas such as visual perception in autonomous vehicles, object manipulation in robots, and augmented reality. In addition, they can be utilized to accurately recognize objects in medical image analysis and 3D modeling. Object-centered representation learning technology provides an important foundation for the efficient processing and analysis of object-centered visual information, and its strength lies in high recognition performance, particularly in complex environments involving multiple objects. Korean Published Patent No. 10-2022-0087567 (July 15, 2022) discloses an object recognition and re-identification technology based on unsupervised contrastive learning utilizing a camera and image tracklets. A learning method for object re-identification includes: a step of generating a camera-level subdomain by classifying data of the corresponding camera based on a camera ID; and a step of performing contrastive learning on the subdomain through a dataset using an object tracklet ID classified by the camera data as a virtual label. FIG. 1 is a diagram illustrating an object-oriented representation learning device through unsupervised semantic segmentation according to an embodiment of the present invention. Figure 2 is a diagram illustrating the functional configuration of an object-oriented representation learning device through unsupervised semantic segmentation of Figure 1. Figure 3 is a diagram illustrating the system configuration of an object-oriented representation learning device through unsupervised semantic segmentation of Figure 1. FIG. 4 is a flowchart illustrating an object-centered representation learning method through unsupervised semantic segmentation according to the present invention. Figure 5 is a diagram of the EiCue generation process of the object-oriented representation learning device through unsupervised semantic segmentation of Figure 1. Figure 6 is a visualization of the eigenvectors derived from S in the Eigen Aggregation module of the object-oriented representation learning device through unsupervised semantic segmentation of Figure 1. Figure 7 is a comparison diagram of the results of training using the ViT-S/8 and ViT-B/8 backbones, respectively, on the (a) COCO-Stuff and (b) Cityscapes datasets of the object-oriented representation learning device through unsupervised semantic segmentation of Figure 1. Figure 8 is a comparison diagram of the K-means and EiCue of the object-oriented representation learning device through unsupervised semantic segmentation of Figure 1. Fi