KR-20260063894-A - Method and apparatus for learning an image-language model, object searching and tracking method and apparatus

KR20260063894AKR 20260063894 AKR20260063894 AKR 20260063894AKR-20260063894-A

Abstract

A method for training an image-language model, an object search and tracking method, and an apparatus are provided. The object search and tracking apparatus performs text embedding for input text using an image-language model. Here, the image-language model is a model trained by setting text tokens obtained through a tokenization model trained using token embeddings that combine at least two tokens representing text related to an image as labels for an image group. The object search and tracking apparatus performs image embedding for object images using the trained image-language model, calculates the similarity between a text feature vector based on text embedding and an image feature vector based on image embedding, and selects at least one object image among the object images as the object image corresponding to the text based on the similarity.

Inventors

김병완
구경모
양성민
박창현

Assignees

주식회사 에스원

Dates

Publication Date: 20260507
Application Date: 20241031

Claims (13)

As an object search and tracking method, An object search and tracking device performs text embedding on input text using a learned image-language model - the image-language model is a model learned by setting text tokens obtained through a tokenization model learned using token embeddings that combine at least two tokens representing text related to an image as labels for an image group -; The object search and tracking device comprises the step of performing image embedding on object images using the learned image-language model; The object search and tracking device comprises the step of calculating the similarity between a text feature vector according to the text embedding and an image feature vector according to the image embedding; and The object search and tracking device comprises the step of selecting at least one object image among the object images based on the similarity as an object image searched in correspondence with the text. Object search and tracking method including
In paragraph 1, A step of performing image embedding on an input query image using the above-mentioned trained image-language model - the input query image is the above-mentioned selected object image or input image -; A step of performing image embedding on object images extracted from an image or object images extracted from a database using the above-mentioned learned image-language model; A step of calculating the similarity between an image feature vector based on the image embedding of the input query image and an image feature vector based on the embedding of the extracted object image; and A step of tracking an object corresponding to the input query image among the extracted object images based on the similarity. Object search and tracking method including further
In paragraph 2, The above-mentioned selection step includes sorting the object images based on the similarity and selecting, among the sorted object images, an object image with a first set similarity or higher as the object image searched corresponding to the text. The step of tracking the object includes tracking an object image having a maximum value and a second set similarity among the extracted object images.
In paragraph 1, An object search and tracking method in which text tokens generated through the above tokenization model are set as labels for an image group to train the above image-language model, and, according to the training, re-recognition features and object attributes of the image and the optimal text tokens describing them correspond to each other and are stored and managed.
As a method for training image-language models, A step of training a tokenization model that performs token embedding so that the learning device finds a text token representing the optimal class corresponding to an image; The above learning device includes the step of setting the text token obtained through the learned tokenization model as the label of the image group; and The learning device performs a step of matching text embeddings and image embeddings of the image-language model using labels set based on the text tokens. A video-language model training method including
In paragraph 5, The step of training the above tokenization model is, A step of generating a text token feature vector by generating a token-specific feature vector for the text through the above tokenization model, and generating a text token feature vector by performing multi-token embedding that combines at least two of the token-specific feature vectors; A step of converting the text token feature vector into a text feature vector through the text encoder of the image-language model; A step of generating an image feature vector for an image input through an image encoder of the above image-language model; A step of calculating loss using the text feature vector and the image feature vector; and Step of training the tokenization model to minimize the above loss A video-language model training method including
In paragraph 5, The step of performing training to match text embeddings and image embeddings of the above image-language model is: A step of setting the above text token as a label for an image group, and fine-tuning the image-language model in a direction that maximizes the similarity between the text feature vector obtained according to the text embedding of the image-language model and the image feature vector obtained according to the image embedding of the image-language model. A video-language model training method including
In paragraph 5, A method for training an image-language model, wherein text tokens generated through the above tokenization model are set as labels for an image group to train the image-language model, and, according to the training, re-recognition features and object attributes of the image and optimal text tokens describing them correspond to each other and are stored and managed.
As an object search and tracking device, Interface device; and A processor configured to perform object search and tracking using an image-language model based on text or images input through the above-mentioned interface device - the above-mentioned image-language model is a model trained by setting text tokens obtained through a tokenization model trained using token embeddings that combine at least two tokens representing text related to an image as labels for an image group - Includes, The above processor is, An operation to perform text embedding on input text using the above-mentioned learned image-language model; Operation of performing image embeddings on object images using the above-mentioned learned image-language model; Operation of calculating the similarity between a text feature vector according to the above text embedding and an image feature vector according to the above image embedding; and Operation of selecting at least one object image among the object images based on the above similarity as an object image searched in correspondence with the text An object search and tracking device configured to perform
In Paragraph 9, The above processor additionally, An operation to perform image embedding on an input query image using the above-mentioned learned image-language model - the above-mentioned input query image is the above-mentioned selected object image or input image -; An operation to perform image embedding on object images extracted from an image or object images extracted from a database, utilizing the above-mentioned learned image-language model; Operation of calculating the similarity between an image feature vector based on the image embedding of the input query image and an image feature vector based on the embedding of the extracted object image; and Operation of tracking objects corresponding to the input query image among the extracted object images based on the similarity. An object search and tracking device configured to perform
In Paragraph 9, The above processor additionally, It is configured to perform the operation of training the image-language model by setting text tokens generated through the above-mentioned learned tokenization model as labels for image groups, and The above-mentioned learning operation is, Operation of training the tokenization model that performs token embedding to find text tokens representing the optimal class corresponding to the image; Operation of setting the text token obtained through the above-mentioned learned tokenization model as the label of the image group; and Operation of fine-tuning the image-language model in a direction that maximizes the similarity between the text feature vector obtained according to the text embedding of the image-language model and the image feature vector obtained according to the image embedding of the image-language model, using labels set based on the text tokens above. An object search and tracking device including
In Paragraph 11, The operation of training the above tokenization model is, An operation to generate a text token feature vector by generating a token-specific feature vector for the text through the above tokenization model, and by performing multi-token embedding that combines at least two of the above token-specific feature vectors; Operation of converting the text token feature vector into a text feature vector through the text encoder of the above image-language model; Operation of generating an image feature vector for an image input through the image encoder of the above image-language model; An operation to calculate loss using the text feature vector and the image feature vector; and Operation to train the tokenization model so as to minimize the above loss An object search and tracking device including
In Paragraph 9, An object search and tracking device in which text tokens generated through the above tokenization model are set as labels for an image group to train the above image-language model, and, according to the training, re-recognition features and object attributes of the image and the optimal text tokens describing them correspond to each other and are stored and managed.

Description

Method and apparatus for learning an image-language model, object searching and tracking method and apparatus The present invention relates to object search and tracking, and more specifically, to a method for learning an image language model, an object search and tracking method, and an apparatus. Due to the increasing demand for intelligent video surveillance, pedestrian attribute and person re-identification technologies are attracting significant attention in the field of computer vision. Pedestrian attributes play a crucial role in video surveillance as semantic descriptions that humans can search for. Pedestrian attributes are utilized in applications such as person re-identification (Re-ID), face verification, and human identification. Pedestrian attribute recognition aims to extract the attributes of a target person given a human image. Person re-recognition has been extensively studied as a problem involving the search and tracking of specific individuals using image features across non-overlapping cameras. Given a query subject of interest, the goal is to identify or track the same person captured by different cameras at different locations, or by the same camera at different times. While research such as CLIP (Contrastive Language-Image Pre-training)-ReID, which utilizes image-language models for training in person re-recognition, is underway, it remains difficult to apply to video surveillance cameras in diverse environments. A relevant prior art document is "Method and apparatus for real-time moving object tracking using a learning rank-based context-aware multi-regulation correlation filter in an image surveillance system," disclosed in Korean Patent Application Publication No. 2024-0145118. FIG. 1 is a diagram showing the structure of an object search and tracking device according to an embodiment of the present invention. FIG. 2 is a flowchart of a method for learning an image-language model according to an embodiment of the present invention. FIG. 3 is an exemplary diagram illustrating the process of learning an image-language model according to an embodiment of the present invention. FIG. 4 is a flowchart of an object search and tracking method according to an embodiment of the present invention. FIG. 5 is an exemplary diagram illustrating an object search process according to an embodiment of the present invention. FIG. 6 is an exemplary diagram illustrating an object tracking process according to an embodiment of the present invention. FIG. 7 is a structural diagram illustrating a computing device for implementing a method according to an embodiment of the present invention. Embodiments of the present invention are described below with reference to the attached drawings so that those skilled in the art can easily implement them. However, the present invention may be embodied in various different forms and is not limited to the embodiments described herein. Furthermore, in order to clearly explain the present invention in the drawings, parts unrelated to the explanation have been omitted, and similar parts throughout the specification are denoted by similar reference numerals. Throughout the specification, when a part is described as "including" a certain component, this means that, unless specifically stated otherwise, it does not exclude other components but may include additional components. Expressions described in the singular in this specification may be interpreted as singular or plural unless explicit expressions such as "one" or "single" are used. Additionally, terms including ordinal numbers, such as first, second, etc., used in embodiments of the present invention may be used to describe components, but the components should not be limited by these terms. The terms are used solely for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be named the second component, and similarly, the second component may be named the first component. Hereinafter, a method for learning an image-language model, an object search and tracking method, and an apparatus according to an embodiment of the present invention will be described. While Video-Language Models (VLMs) that link images and text are being actively researched, the technology to utilize human objects for tracking and search by associating them with attributes presents a challenging task due to factors such as diverse viewpoints, various low-resolution images, lighting variations, complex camera environments, and background clutter. In an embodiment of the present invention, an image-language model is integratedly trained by connecting recognition features (recognition images) and object attributes (attribute images) with high-level semantic features of text, and object search and tracking are performed using the image-language model trained in this way. FIG. 1 is a diagram showing the structure of an object search and tracking devi