KR-20260064157-A - Method and device for training artificial intelligence algorithm model using fused multi-modal

KR20260064157AKR 20260064157 AKR20260064157 AKR 20260064157AKR-20260064157-A

Abstract

A method performed by an electronic device using artificial intelligence according to an embodiment of the present invention comprises the steps of: receiving image data; and outputting data including attribute and object information for the image data, using the image data as input to a pre-trained artificial intelligence algorithm model, wherein the pre-trained artificial intelligence algorithm model may be trained by taking prompt data including soft prompts and hard prompts and the image data as input.

Inventors

정수아
김동진

Assignees

한양대학교 산학협력단

Dates

Publication Date: 20260507
Application Date: 20241031

Claims (15)

In a method performed by an electronic device using artificial intelligence, Step of receiving image data; The method includes the step of taking the image data as input in a pre-trained artificial intelligence algorithm model and outputting data including attribute and object information regarding the image data, and The pre-trained artificial intelligence algorithm model is: A method for learning using prompt data including soft prompts and hard prompts, and image data as input.
In paragraph 1, The above soft prompt includes a soft prompt representing an attribute and a soft prompt representing an object, and The above hard prompt is a method including a hard prompt that simultaneously displays attributes and objects.
In paragraph 1, The above-mentioned pre-trained artificial intelligence algorithm model is: A method for learning text features extracted from the soft prompt and hard prompt, respectively, and image features extracted from the image data using a cross attention mechanism.
In paragraph 3, The above text features include attribute text features, object text features, and attribute-object text features, and The above image features include a method comprising an attribute image feature, an object image feature, and an attribute-object image feature.
In paragraph 4, The above-mentioned pre-trained artificial intelligence algorithm model is: In the first structure, the text features are used as a query, and cross-attention is performed using the image features as a key and value to extract inter-fused text features, and A method configured to extract inter-fused image features by performing cross-attention using the image features as a query and the text features as key and value in the first structure above.
In paragraph 5, The above-mentioned pre-trained artificial intelligence algorithm model is: In the second structure, the extracted inter-fused text features are used as a query, and cross-attention is performed using the text features as key and value to extract intra-fused text features, and A method configured in the second structure above to extract intra-fused image features by using the inter-fused image features as a query and performing cross-untension with the image features as key and value.
In paragraph 6, The above-mentioned pre-trained artificial intelligence algorithm model is: Determine prompt loss based on the similarity between the text features and the image features, and Inter-losses are determined based on the similarity between the above inter-fused text features and inter-fused image features, and A method configured to determine intra-losses based on similarity between the intra-fused text features and the intra-fused image features.
In Paragraph 7, The above-mentioned pre-trained artificial intelligence algorithm model is: Pre-trained based on the above prompt losses, the above inter losses and the above intra losses, and A method in which the above prompt losses, the above inter losses and the above intra losses include cross entropy (CE) losses.
In electronic devices, Memory; Modem; and It includes the above modem and a processor connected to the above memory, The above processor is: Receive image data, A pre-trained artificial intelligence algorithm model is configured to take the image data as input and output data including attribute and object information regarding the image data, and The pre-trained artificial intelligence algorithm model is: An electronic device that learns using prompt data including soft prompts and hard prompts, and image data as input.
In Paragraph 9, The above soft prompt includes a soft prompt representing an attribute and a soft prompt representing an object, and The above hard prompt is an electronic device comprising a hard prompt that simultaneously displays attributes and objects.
In Paragraph 10, The above-mentioned pre-trained artificial intelligence algorithm model is: In the first structure, cross-attention is performed using text features as queries and image features as keys and values to extract inter-fused text features, and An electronic device configured to extract inter-fused image features by performing cross-attention using the image features as a query and the text features as key and value in the first structure above.
In Paragraph 11, The above-mentioned pre-trained artificial intelligence algorithm model is: In the second structure, the extracted inter-fused text features are used as a query, and cross-attention is performed using the text features as key and value to extract intra-fused text features, and An electronic device configured to extract intra-fused image features by performing cross-untension using the image features as key and value in the second structure above, wherein the inter-fused image features are used as a query.
In Paragraph 12, The above-mentioned pre-trained artificial intelligence algorithm model is: Determine prompt loss based on the similarity between the text features and the image features, and Inter-losses are determined based on the similarity between the above inter-fused text features and inter-fused image features, and An electronic device configured to determine intra-losses based on the similarity between the intra-fused text features and the intra-fused image features.
In Paragraph 13, The above-mentioned pre-trained artificial intelligence algorithm model is: Pre-trained based on the above prompt losses, the above inter losses and the above intra losses, and The above prompt losses, the above inter losses and the above intra losses are an electronic device including cross entropy (CE) losses.
As a program stored on a medium for analyzing data through an artificial intelligence algorithm executable by a processor, Step of receiving image data; The method includes the step of taking the image data as input in a pre-trained artificial intelligence algorithm model and outputting data including attribute and object information regarding the image data, and The pre-trained artificial intelligence algorithm model is: A program configured to learn using prompt data including soft prompts and hard prompts, and image data as input.

Description

Method and device for training artificial intelligence algorithm model using fused multi-modal The present disclosure describes a method and apparatus for training an artificial intelligence algorithm model using fused multimodals. Artificial intelligence algorithm models utilize various learning techniques. Among these, there are methods that allow models to be trained without manually labeling data, and a representative example is zero-shot learning. Zero-shot learning refers to a method that enables a model to predict untrained classes or concepts without prior learning. While traditional learning methods could only identify new data within pre-trained classes, zero-shot learning is characterized by its ability to identify classes different from the trained ones by generalizing through relationships or attributes between classes. Zero-shot learning leverages prior knowledge, which can take the form of verbal descriptions, attributes, or image features. Furthermore, compositional zero-shot learning is a type of zero-shot learning method in which a model learns new concepts through a combination of various basic components. While conventional zero-shot learning simply performs predictions between classes, compositional zero-shot learning can predict new concepts through combinations of already learned basic components. Traditionally, research has focused on modifying models using prompt configurations or prompts to improve the structure of zero-shot learning methods. However, existing studies have shown limitations in accurately identifying subtle semantic nuances or precisely representing states and objects. Therefore, a method is required to effectively and efficiently implement a zero-shot learning approach. A brief description of each drawing is provided to help to better understand the drawings cited in the detailed description of the invention. FIG. 1 is a conceptual diagram illustrating the basic principles of artificial intelligence technology according to one embodiment of the present disclosure. FIG. 2 is a diagram illustrating the learning process of an artificial intelligence algorithm model with fused multimodal applied, according to one embodiment of the present disclosure. FIG. 3 is a diagram showing the loss used in the training of an artificial intelligence algorithm model with a fused multimodal applied according to one embodiment of the present disclosure. FIG. 4 is a block diagram of an electronic device to which an artificial intelligence algorithm model according to one embodiment of the present disclosure is applied. FIG. 5 is a flowchart illustrating a method for performing artificial intelligence algorithm learning according to one embodiment of the present disclosure. The technical concept of the present invention is subject to various modifications and may have various embodiments. Specific embodiments are illustrated in the drawings and described in detail through the detailed description. However, this is not intended to limit the technical concept of the present invention to specific embodiments, and it should be understood that it includes all modifications, equivalents, and substitutions that fall within the scope of the technical concept of the present invention. In describing the technical concept of the present invention, detailed descriptions of related prior art are omitted if it is determined that such descriptions may unnecessarily obscure the essence of the invention. Furthermore, numbers used in the description of this specification (e.g., First, Second, etc.) are merely identification symbols to distinguish one component from another. In addition, when a component is described in this specification as being "connected" or "connected" to another component, it should be understood that the component may be directly connected to or directly connected to the other component, but unless otherwise specifically stated, it may also be connected or connected through another component in between. In addition, terms such as “~part,” “~device,” “~device,” and “~module” described in this specification refer to a unit that processes at least one function or operation, and may be implemented as hardware or software or a combination of hardware and software such as a processor, microprocessor, microcontroller, CPU (Central Processing Unit), GPU (Graphics Processing Unit), APU (Accelerate Processor Unit), DSP (Drive Signal Processor), ASIC (Application Specific Integrated Circuit), and FPGA (Field Programmable Gate Array), and may also be implemented in a form combined with memory that stores data necessary for processing at least one function or operation. Furthermore, it is intended to clarify that the classification of components in this specification is merely based on the primary function each component is responsible for. That is, two or more components described below may be combined into a single component, or a single component may be divided into two or more components ba