KR-20260064340-A - Method And Device for Task-Specific Knowledge Distillation based on Vision-Language Mode

KR20260064340AKR 20260064340 AKR20260064340 AKR 20260064340AKR-20260064340-A

Abstract

A method and apparatus for knowledge distillation based on a vision-language model are disclosed. According to one aspect of the present disclosure, a computer-implemented method is provided comprising: a process of converting one or more texts into one or more text representations using a text encoder of a vision-language model; a process of converting one or more images into one or more first image representations using an image encoder of a vision-language model; a process of converting the one or more images into second image representations using a student model; a process of compressing each of the one or more text representations and the one or more first image representations; and a process of updating the parameters of the student model using a loss derived based on the one or more text representations, one or more compressed text representations, the one or more first image representations, one or more compressed first image representations, and the one or more second image representations.

Inventors

장진성
마춘페이
이병원

Assignees

에스케이텔레콤 주식회사

Dates

Publication Date: 20260507
Application Date: 20241031

Claims (9)

In a computer implementation method for distilling knowledge of a vision-language model including an image encoder and a text encoder into a student model, A process of converting one or more texts into one or more text representations in a first feature space shared by the image encoder and the text encoder using the text encoder; A process of converting one or more images corresponding to at least some of one or more texts into one or more first image representations in the first feature space using the above image encoder; A process of converting one or more images into one or more second image representations in a second feature space having a lower dimensionality than the first feature space, using the above student model; A process of compressing each of the above one or more text representations and the above one or more first image representations to have the same number of dimensions as the above second feature space; and A process of updating the parameters of the student model using a loss derived based on the above one or more text representations, one or more compressed text representations, the above one or more first image representations, one or more compressed first image representations, and the above one or more second image representations. A computer implementation method including
In paragraph 1, The above compression process is, A process of projecting one or more text representations using a first projection layer into a third feature space having the same number of dimensions as the second feature space; and The method includes a process of projecting one or more first image representations into a fourth feature space having the same number of dimensions as the second feature space using a second projection layer, and The above update process is, A computer implementation method for updating the parameters of the student model, the parameters of the first projection layer, and the parameters of the second projection layer based on the above loss.
In paragraph 1, The above loss is, A computer implementation method comprising a visual knowledge distillation loss term that measures the discrepancy between the relational structure of the second image representations corresponding to different images and the relational structure of the compressed first image representations corresponding to the different images.
In paragraph 1, The above loss is, A computer implementation method comprising a linguistic knowledge distillation loss term, which measures the discrepancy between the similarity distribution of one or more text expressions for a first image expression corresponding to a specific image and the similarity distribution of one or more compressed text expressions for a second image expression corresponding to the specific image.
In paragraph 1, The above student model is a task-specific classification model that classifies an input image into one or more classes defined for a specific task, and A computer implementation method in which the above one or more second image representations are intermediate representations extracted by an encoder included in the student model for each of the above one or more images.
In paragraph 5, A computer-implemented method comprising the above loss, which includes a task-specific classification loss term derived based on the correct labels assigned to the one or more images and the prediction of the student model for the one or more images.
In paragraph 5, A computer implementation method comprising one or more of the above texts, each of which includes natural language sentences describing images corresponding to one or more of the above classes.
Memory for storing instructions; and at least one processor, comprising The above at least one processor executes the above instructions, Using the text encoder of a vision-language model including an image encoder and a text encoder, one or more texts are converted into one or more text representations in a first feature space shared by the image encoder and the text encoder, and Using the image encoder above, one or more images corresponding to at least some of one or more texts are converted into one or more first image representations in the first feature space, and Using a student model, the above one or more images are transformed into one or more second image representations in a second feature space having a lower dimensionality than the first feature space, and The above one or more text representations and the above one or more first image representations are each compressed to have the same number of dimensions as the second feature space, and A computer-implemented method for updating parameters of a student model using a loss derived based on one or more text representations, one or more compressed text representations, one or more first image representations, one or more compressed first image representations, and one or more second image representations.
A computer program stored on a computer-readable recording medium to execute the processes included in the method according to any one of paragraphs 1 through 7.

Description

Method and Device for Task-Specific Knowledge Distillation based on Vision-Language Mode The present disclosure relates to a method and apparatus for knowledge distillation based on a vision-language model. The following description merely provides background information related to the present embodiment and does not constitute prior art. Computer vision technology is widely applied across various fields. In particular, deep learning-based models are achieving outstanding performance in diverse computer vision tasks. However, most of these models are large-scale, complex in structure with numerous parameters that require significant computational resources and memory space, posing significant limitations to their use in real-world applications. For instance, it is extremely difficult to efficiently execute large models in resource-constrained environments such as mobile devices, IoT devices, or edge computing. Consequently, research on lightweight models that require minimal computation and storage space is actively underway. The need for lightweight models is particularly pronounced in applications with high demands for real-time processing. For instance, lightweight models are required for fast and accurate processing in applications requiring real-time behavior recognition, such as video surveillance, sports analysis, and medical monitoring. However, due to the performance degradation that occurs during the lightweighting process, lightweighting a model while maintaining high accuracy remains a challenging task. Knowledge Distillation (KD) is a representative technique for lightweighting. Knowledge Distillation is a learning method that transfers knowledge from a large, high-performance teacher model to a lightweight student model. By compressing and transferring the complex knowledge of the teacher model to the student model, it enables the achievement of results close to the teacher model's performance while reducing the model's size and complexity. Conventional knowledge distillation techniques consist of a two-step process: first, a large teacher model is trained on a specific task, and then this knowledge is transferred to a small student model in a separate distillation stage. In this approach, significant time and computational resources are consumed because the process of retraining the teacher model and distilling it into the student model must be repeated whenever a new task is given. FIG. 1 is a block diagram schematically illustrating an exemplary student model to which the present disclosure may be applied. FIG. 2 is an exemplary diagram schematically illustrating a vision-language model-based knowledge distillation framework according to one embodiment of the present disclosure. FIG. 3 is an illustrative diagram referenced to explain visual knowledge distillation according to one embodiment of the present disclosure. FIG. 4 is an illustrative diagram referenced to explain linguistic knowledge distillation according to one embodiment of the present disclosure. FIG. 5 is a flowchart illustrating a vision-language model-based knowledge distillation method according to one embodiment of the present disclosure. FIG. 6 is a schematic block diagram of an exemplary computing device that can be used to implement the devices and methods described in the present disclosure. Some embodiments of the present disclosure are described in detail below with reference to exemplary drawings. It should be noted that in assigning reference numerals to the components of each drawing, the same components are given the same reference numeral whenever possible, even if they are shown in different drawings. Furthermore, in describing the present disclosure, if it is determined that a detailed description of related known components or functions could obscure the essence of the present disclosure, such detailed description is omitted. In describing the components of the embodiments according to the present disclosure, symbols such as first, second, i), ii), a), b), etc., may be used. These symbols are intended only to distinguish the components from other components, and the essence, order, or sequence of the components is not limited by the symbols. When a part in the specification is described as 'comprising' or 'having' a component, this means that, unless explicitly stated otherwise, it does not exclude other components but may include additional components. The detailed description set forth below, together with the accompanying drawings, is intended to describe exemplary embodiments of the present disclosure and is not intended to represent the only embodiment in which the present disclosure can be practiced. FIG. 1 is a block diagram schematically illustrating an exemplary computer vision model to which the present disclosure can be applied. The present disclosure proposes a technique for distilling knowledge of a Vision-Language Model (VLM) into a student model. Here, the student model may be a computer vision