CN-121982382-A - Open vocabulary multi-label image classification method based on hierarchical double-granularity alignment

CN121982382ACN 121982382 ACN121982382 ACN 121982382ACN-121982382-A

Abstract

The invention relates to an open vocabulary multi-label image classification method based on hierarchical double-granularity alignment, which comprises the following steps of selecting an image dataset with multiple labels Constructing a prompt template And utilize The method comprises the steps of performing text processing on image labels, obtaining visual embedding and text embedding by using the image and the text labels, obtaining corresponding feature vectors by the visual embedding and the text embedding respectively, and obtaining a final multi-label classification result of the image by using the feature vectors. Compared with the traditional multi-label classification, the method provided by the invention has more accurate image classification capability.

Inventors

HUANG CHENG
HONG MINGJIAN
LIU BEIYAN
Shi Huazhan
ZHANG XIN
WANG MENGYAO
QIU KEZHEN
XU LING
GE YONGXIN
YANG MENGNING

Assignees

重庆大学

Dates

Publication Date: 20260505
Application Date: 20260114

Claims (4)

1. The open vocabulary multi-label image classification method based on hierarchical double granularity alignment is characterized by comprising the following steps: S1, selecting an image dataset with multiple labels And D contains non-duplicate w class labels, wherein, , Represent the first The number of image samples is one, Represent the first The image labels corresponding to the individual image samples, Represents the total number of image samples, an Each piece of data contains classification information of the data; And (2) and And The category labels are not overlapped, wherein, The visible label dataset is represented and contains a total number of categories d, Representing an unseen tag dataset; Constructing a classification model M, wherein the M comprises a data coding module, a visual-semantic two-way interaction module and a hierarchical dual-granularity alignment module; The data coding module comprises a visual coder, a text coder and a category prototype extraction module, wherein the visual coder is used for coding image data, the text coder is used for coding text data, and the category prototype extraction module is used for clustering output results of the text coder; the visual-semantic two-way interaction module is used for extracting features of the embedded features; The hierarchical double-granularity alignment module is used for calculating a label prediction score of the image; s2, constructing a prompt template Will be By passing through Inputting the processed images into a text encoder to obtain category semantic embedded sets of all the image samples ; Will be Clustering operation is carried out to obtain a plurality of generalized category embedded sets, each generalized category embedded set is defined as a category prototype embedded, and the expression is as follows: Wherein, the Represent the first Class prototype embedding of individual classes , , Representing the total number of categories after clustering, Representing a clustering operation; s3, selecting the first visible label data set Individual image annotation Corresponding first Individual image samples Will be Input visual encoder, output to obtain visual embedding ; S4, will And The common input vision-semantic two-way interaction module performs feature extraction to obtain class specific features And category-level visual features ; S5, will 、 And Input hierarchical double-granularity alignment module, and calculate to obtain Tag predictive score of (a) Defining a score threshold value of each class label in D Will be And each score threshold value Comparing when When it is considered that Belongs to the category corresponding to the score threshold value and is recorded as 1, otherwise The category not belonging to is marked as 0; When (when) Comparing with d score threshold values to obtain Multi-label classification results of (2) ; S6, traversing D, repeating S3-S5 to obtain a multi-label classification prediction result set of all the image samples ; S7, constructing a loss function of the classification model based on the asymmetric loss Will be And Training M as input, updating M parameter by gradient descent counter propagation, stopping training when the loss function converges or training reaches maximum iteration number to obtain trained M', The calculation formula of (2) is as follows: Wherein, the And All of which represent the weight of the object, Representing the total number of all categories defined as visible labels, Representation of Corresponds to the first The multi-label prediction of the class classifies the result, Representation of Corresponds to the first Labeling of classes; s8, selecting the images A and the endorsements of the images A to be classified, and passing the endorsements of the images A through And (3) obtaining A ' after processing, and inputting the A and the A ' into M ' to obtain a multi-label classification prediction result of the A.
2. The method for classifying open vocabulary multi-label images based on hierarchical dual granularity alignment according to claim 1, wherein the S2 category semantic embedding set is characterized in that The content of (2) is as follows: Carrying out category semantic extraction on any image label by adopting a prompt template to obtain category semantic embedding The expression is as follows: Wherein, the The representation is marked as Is to be used in the meaning of the term (1), A text encoder is represented by a representation of the text, A dimension representing semantic embedding; Traversing Obtaining category semantic embedding of each label, and forming all category semantic embedding structures 。
3. The method for classifying open vocabulary multi-label images based on hierarchical dual granularity alignment according to claim 2, wherein the step S3 is characterized in that visual embedding is obtained The content of (2) is as follows: s3-1 to give Is divided into The number of partial blocks is one, Is denoted as I, wherein The number of channels is indicated and the number of channels is indicated, Representing the height of the image and, Representing the image width; s3-2 will And The combination is used as the input and output of the visual encoder to obtain Visual embedding of (a) The expression is as follows: Wherein, the Representation of Is globally embedded and of , Representation of Is embedded locally and of , The visual encoder is represented by a visual encoder, Representing the dimensions of the visual embedding.
4. The method for classifying open vocabulary multi-label images based on hierarchical dual granularity alignment according to claim 3, wherein the method is characterized in that the method is obtained in S5 Tag predictive score of (a) The steps of (a) are as follows: S5-1 calculation And (3) with Cosine similarity value between , And (3) with Cosine similarity value between , And (3) with Cosine similarity value between The formula is as follows: Wherein, the Represent the first Class specific features of the individual local blocks; representing the average of the first u values in the selected descending order; representing the average of the first v values in descending order; s5-2 calculation 、 And The average value among the three is calculated as follows: Wherein, the Mean value is represented, at this time As a final tag prediction score.

Description

Open vocabulary multi-label image classification method based on hierarchical double-granularity alignment Technical Field The invention relates to the technical field of computer vision, in particular to an open vocabulary multi-label image classification method based on hierarchical double-granularity alignment. Background The open vocabulary multi-label image classification OV-MLIC is an emerging task in the field of computer vision, aiming at identifying unseen categories in real scenes by utilizing visual-language pre-training VLP models such as CLIP and the like. The generalization ability of deep learning models in image classification tasks is often limited by the difficulty in acquiring specific annotation data. To address this challenge, the concept of open vocabulary OV classification was proposed with the goal of achieving recognition of a wide variety of class sets by fusing external semantic knowledge. Unlike traditional zero-sample learning ZSL methods, such methods are typically trained on a fixed and predefined set of categories, and open vocabulary classification methods utilize rich, vision-related language characterizations, supporting recognition in open category space. These methods typically express open vocabulary classification tasks as visual-semantic alignment problems or sample-class matching problems. However, the inherent information gap between visual and semantic representations presents a significant challenge to open vocabulary classification. Visual-language pre-training VLP models (e.g., contrast language-image pre-training CLIP models) exhibit great potential in open vocabulary tasks. This benefits primarily from its strong visual-text modality alignment capability, enabling accurate identification of the unseen category. In a single-tag open vocabulary classification scenario, using the output of the CLIP visual encoder as a global visual feature has become a widely adopted and empirically validated approach. However, images in real scenes often contain multiple objects, which means that the data has multi-tag properties in nature. In such scenarios, the use of global visual features may lead to feature coupling between different classes, thereby causing confusion in sample-class matching, ultimately degrading classification performance. The existing open vocabulary multi-label image classification OV-MLIC method adopts a compromise strategy to solve the problem, and limits visual-semantic matching to a local area with low category co-occurrence frequency. However, these approaches still have fundamental limitations in addressing the category coupling problem. Even within highly localized regions, interactions or co-occurrences of different categories may still occur. A more intuitive solution is to use semantic class information to decouple class-specific features from visual representations. However, the multi-label image dataset often has serious class imbalance problems, which can lead to a high-frequency class dominant model optimization process, and further, the model is biased towards the high-frequency class, and such bias can lead to recognition of the unseen class to be excessively dependent on the high-frequency class, rather than the complete class set. In addition, the extracted features have significant differences among different samples due to significant occlusion or scale differences among different categories in a single image. Such inconsistencies can affect the accuracy and stability of the visual-semantic alignment, resulting in inaccuracy in the final image classification. Disclosure of Invention Aiming at the problems in the prior art, the invention aims to solve the technical problems of improving the classification accuracy and generalization of the open vocabulary multi-label images. In order to solve the technical problems, the invention adopts the following technical scheme: a hierarchical dual granularity alignment-based open vocabulary multi-label image classification method comprises the following steps: S1, selecting an image dataset with multiple labels And D contains non-duplicate w class labels, wherein,,Represent the firstThe number of image samples is one,Represent the firstThe image labels corresponding to the individual image samples,Represents the total number of image samples, anEach piece of data contains classification information of the data; And (2) and AndThe category labels are not overlapped, wherein,The visible label dataset is represented and contains a total number of categories d,Representing an unseen tag dataset; Constructing a classification model M, wherein the M comprises a data coding module, a visual-semantic two-way interaction module and a hierarchical dual-granularity alignment module; The data coding module comprises a visual coder, a text coder and a category prototype extraction module, wherein the visual coder is used for coding image data, the text coder is used for coding text data, and the category pr