CN-122020370-A - Visual language model classification self-adaption method based on transduction information maximization

CN122020370ACN 122020370 ACN122020370 ACN 122020370ACN-122020370-A

Abstract

The invention discloses a visual language model classification self-adaptive method based on maximization of transduction information, which solves the problems of high calculation cost and poor universality caused by unstable initialization of a classifier and update of model parameters in zero-sample and few-sample scenes in the prior art, and effectively improves the classification performance of the classifier in the few-sample scenes. And constructing a joint class prototype as an initial parameter of the classifier by combining the support set real label and the prediction class indication information of the query set. And constructing a transduction information maximization objective function containing entropy regularization and KL divergence regularization by taking the query set as a transduction object, and performing iterative optimization on classifier parameters to obtain a final model. And carrying out classified prediction on the query set based on the final model and outputting a result.

Inventors

LI YINGPING
ZOU YUTONG
GOU SHUIPING
LIU BO
WANG XINLIN
GUO ZHANG

Assignees

西安电子科技大学

Dates

Publication Date: 20260512
Application Date: 20260119

Claims (10)

1. A visual language model classification adaptive method based on transduction information maximization, comprising: Extracting a support set using a pre-trained visual language big model And a query set Is of the visual feature vector of (a) Text feature representation for each class And based on the visual feature vector Constructing a classifier predictive probability model taking an optimizable class prototype vector as a parameter; Based on the support set And the set of queries Visual feature vectors corresponding to the sample images Support set True category labels corresponding to the samples are combined with the query set Constructing a joint class prototype serving as an initial parameter of the classifier predictive probability model by using the predictive class indication information of the sample; with the query set Constructing an objective function with maximized transduction information for a transduction object, and performing iterative optimization on initial parameters of a classifier predictive probability model according to the objective function to obtain a final classifier predictive probability model, wherein the transduction information maximizing objective function comprises an entropy regularization term and a KL divergence regularization term; predicting a probability model for the set of queries based on a final classifier And carrying out classification prediction and outputting a classification result.
2. The visual language model classification adaptive method based on transduction information maximization according to claim 1, wherein the support set is extracted using a pre-trained visual language big model And a query set Is of the visual feature vector of (a) Text feature representation for each class Comprising: a visual encoder for the support set using the visual language big model And the set of queries Each sample image in the sequence is encoded to obtain a corresponding visual characteristic vector ; Encoding each text description by using the text encoder of the visual language big model to obtain corresponding text characteristic representation Wherein the visual feature vector With text feature representation In the same feature space and has been normalized.
3. The visual language model classification adaptive method based on transduction information maximization according to claim 1, wherein the visual feature vector Expressed as: ; the text feature representation Expressed as: ; Wherein, the Represent the first Input images; Representing a visual encoder in the pre-trained visual language big model; Represent the first Visual feature vectors corresponding to the input images; Represent the first Text descriptions of the individual categories; Representing a text encoder in the pre-trained visual language big model; Represent the first Text feature representations of the individual categories.
4. The visual language model classification adaptive method based on transduction information maximization according to claim 1, wherein the zero sample prediction probability model corresponding to the pre-trained visual language big model is expressed as: ; Wherein, the Representing a temperature scaling factor; Represent the first Visual feature vectors corresponding to the input images; Represent the first Text feature representations of the individual categories; Representing a pre-trained visual language model for the first The input image is discriminated as the first A predictive probability result for the class; Represent the first Text feature representations of the individual categories; Representing the total number of categories.
5. The visual language model classification adaptive method based on transduction information maximization according to claim 1, wherein the classifier predictive probability model is expressed as: ; Wherein, the A random variable representing a class label; Representing a category index; a random variable representing an input image; Represent the first Input images; a weight matrix representing a classifier predictive probability model; Representing a visual encoder in the pre-trained visual language big model; representing a temperature parameter; Represent the first Visual feature vectors corresponding to the input images; Represent the first A feature prototype vector corresponding to the class; Represent the first A feature prototype vector corresponding to the class; Represent the first The input image belonging to the first Posterior probability of class.
6. The visual language model classification adaptive method based on transduction information maximization of claim 1, wherein the support set-based And the set of queries Visual feature vectors corresponding to the sample images Support set True category labels corresponding to the samples are combined with the query set The prediction category indication information of the sample is used for constructing a joint category prototype which is used as an initial parameter of the classifier prediction probability model, and the method comprises the following steps: query set through the pre-trained visual language big model Zero sample reasoning is carried out, and the prediction type indication information is obtained; Support set based on real class label and predicted class indication information And a query set Is of the visual feature vector of (a) And carrying out weighted aggregation to obtain a joint class feature prototype, and taking the joint class prototype as an initial parameter of the classifier predictive probability model.
7. The visual language model classification adaptive method based on transduction information maximization according to claim 6, wherein in a few-sample learning scenario, the joint class prototype of a few samples is represented as: ; Wherein, the Representing a support set; Representing a set of queries; Representing a support set Middle (f) For the first input image True tags of the individual categories; Represent the first Visual feature vectors corresponding to the input images; Representing a set of queries Middle (f) For the first input image Hard coded prediction of the individual categories; representing the total number of categories; Representing the first in a less sample scenario An initial joint class prototype vector of classes.
8. The visual language model classification adaptive method based on transduction information maximization according to claim 6, wherein in a zero sample learning scenario, the joint class prototype of zero sample is represented as: ; Wherein, the Representing a set of queries; Representing a set of queries Middle (f) For the first input image Soft coding prediction of the individual classes; Represent the first Visual feature vectors corresponding to the input images; representing the total number of categories; Representing the first sample in a zero sample scenario An initial class prototype vector of a class.
9. The visual language model classification adaptive method based on transduction information maximization according to claim 1, wherein the transduction information maximization objective function is expressed as: ; Wherein, the Representing a first trade-off coefficient; Expressed in a support set Cross entropy loss on; expressed in a query set Is a mutual information item of (a); Representing a set of queries Is a visual feature set of (1); Representing a set of queries Is a predictive tag set of (1); Representing a second trade-off coefficient; Represents KL divergence; Representing the output probability distribution of the classifier predictive probability model; representing a predictive probability distribution in a zero sample scene based on a pre-trained visual language large model.
10. The visual language model classification adaptive method based on transduction information maximization according to claim 1, wherein the iterative optimization of initial parameters of the classifier predictive probability model according to the objective function to obtain a final classifier predictive probability model comprises: minimizing the objective function of transduction information maximization by alternate direction multiplier method by introducing auxiliary variables Carrying out iterative solution on the objective function; In each iterative solving process, the method comprises the steps of alternately executing the steps of updating auxiliary variables And classifier predictive probability model parameters 。

Description

Visual language model classification self-adaption method based on transduction information maximization Technical Field The invention relates to the technical field of artificial intelligence and computer vision, in particular to a visual language model classification self-adaption method based on transduction information maximization. Background With the rapid development of artificial intelligence technology, a computer vision method based on deep learning has made remarkable progress in tasks such as image classification, target recognition and the like. In recent years, visual Language Models (VLMs) are pre-trained on large-scale graphic data by jointly modeling visual information and natural Language semantics, so that the Models have strong cross-modal representation capability and generalization capability, and typical representatives include Models such as CLIP. Such models exhibit good potential in zero sample classification and few sample learning tasks, becoming important fundamental models in current research and applications. However, in the actual application scene, the target task often has the problems of rare or even complete missing of the labeling sample, obvious deviation of data distribution and pre-training stage, and the like, and the classification performance is easily reduced by directly utilizing the pre-training visual language model for reasoning. In order to improve the adaptability of the model on target data, researchers propose various downstream adaptation methods based on visual language models. In the prior art, one type of method carries out fine adjustment or prompt learning on the model through a small number of labeled samples so as to guide the model to better adapt to a target task. Such methods typically rely on true tag information in the support set, enabling some performance improvement in a few sample scenario. Meanwhile, partial methods need to update parameters of the visual language model, have defects in the aspects of calculation cost, stability and model reusability, and are not suitable for application scenes of frozen models or black box models. Another class of methods focuses on adaptive learning or testing, and uses unlabeled data in the target domain to adaptively optimize the model to mitigate performance degradation due to distribution bias. For example, part of the methods are used for adjusting the model by means of cluster analysis, pseudo tag generation and the like and utilizing the structure information of unlabeled samples. Although the method has certain advantages under the condition of no label, the performance of the method is highly dependent on the accuracy of pseudo labels or clustering results, and error accumulation is easy to occur under the condition of complex data distribution or large noise, so that the stability of a model and the final classification effect are influenced. In addition, in the model initialization stage, the existing partial method generally only depends on a support set with a smaller scale to construct a classifier, and the whole distribution information of target query data cannot be fully utilized, so that the initialization parameters deviate from the real target distribution, and the effect of subsequent self-adaptive optimization is limited. In summary, the existing classification method based on the visual language big model still has a certain limitation under the zero sample and few sample scenes. On one hand, the method generally has strong dependence on labeling samples, has limited adaptability to target data distribution under a non-label or weak supervision scene, and on the other hand, the existing classifier initialization mode is often unstable enough, and the real class structure of the target data is difficult to fully describe, so that the subsequent prediction performance is influenced. In addition, part of methods need to update the pre-trained visual language model parameters, so that the calculation cost is high, and the universality and the expandability of the methods are reduced to a certain extent. Disclosure of Invention The invention provides a visual language model classification self-adaption method based on transduction information maximization, which solves the problems of high calculation cost and poor universality caused by over-strong dependence on labeling samples, unstable classifier initialization and model parameter update in zero-sample and few-sample scenes in the prior art, and effectively improves the initial discrimination capability and stability of the classifier in the few-sample scenes. The invention provides a visual language model classification self-adaption method based on transduction information maximization, which comprises the following steps: Extracting a support set using a pre-trained visual language big model And a query setIs of the visual feature vector of (a)Text feature representation for each classAnd based on the visual feature vectorConstr