CN-121999276-A - Memory-enhanced multi-layer prompt learning method for classifying few sample images

CN121999276ACN 121999276 ACN121999276 ACN 121999276ACN-121999276-A

Abstract

The invention discloses a memory-enhanced multi-layer prompt learning method for classifying few sample images, which comprises the steps of constructing a learnable global and multi-layer local text prompt for learning global and local characterization of images, constructing a dynamic cross-sample multi-layer feature memory library, enhancing original local features by utilizing a sparse cross-attention enhancement module to generate high-discrimination features, learning the constructed text prompt by jointly optimizing global and local classification loss, and finally fusing prediction scores of global and local branches to classify the few sample images. The invention integrates the cross-sample memory library into the prompt learning framework, and makes up the bottleneck of directly learning low-quality original local features. The memory enhancement mechanism fuses high-quality local features inside the model, avoids external dependence and inhibits background noise, so that the model learns a stable and high-discriminant decision boundary for local prompt.

Inventors

MA ZHONGCHEN
WANG AICHEN
ZHANG LIYUAN
CHENG KEYANG
GOU JIANPING
SONG JIAHAO
GAO HUIPING
MAO QIRONG
REN QINGHUA
JIA HONGJIE
WAN JIKANG
HE WENWEN
ZHU QINGZHEN

Assignees

江苏大学

Dates

Publication Date: 20260508
Application Date: 20251229

Claims (8)

1. A memory-enhanced multi-layer hint learning method for classifying a few sample image, the method comprising the steps of: step S1, constructing a learnable global text prompt and a multi-layer local text prompt; S2, constructing a dynamic cross-sample multi-layer feature memory bank, adopting a sparse cross-attention enhancement module, and enhancing original local features extracted from an image by using the memory bank to generate memory enhanced local features; Step S3, training the global text prompt and the multi-layer local text prompt by calculating the global classification loss and the local classification loss and combining and optimizing the two types of losses so as to determine parameters of the global text prompt and the multi-layer local text prompt; And S4, in the reasoning stage, fusing the prediction scores based on the global text prompt and the multi-layer local text prompt, and classifying the images with few samples.
2. The method of claim 1, wherein the global text prompt and the multi-layered local text prompt each comprise a learnable embedded vector and a category-related embedded vector.
3. The method according to claim 1, wherein in the step S2, the step of constructing the dynamic cross-sample multi-layer feature memory includes (1) screening a part of candidate features from original local features, (2) distributing the candidate features into a memory made up of a plurality of clusters, and dynamically updating the cluster features according to a preset rule.
4. A method according to claim 3, wherein the screening is based on the attention weight of the original local feature in a visual encoder to determine its significance.
5. The method according to claim 1, wherein in the step S2, the sparse cross-attention enhancement module includes (1) taking an original local feature to be enhanced as a query, and taking related memory features retrieved from the memory bank as keys and values, (2) obtaining enhancement information through cross-attention mechanism calculation, and fusing the enhancement information with the original local feature to generate the memory-enhanced local feature.
6. The method of claim 5, wherein the cross-attention mechanism is sparse by selecting a particular number of memory features with highest attention weights for weighted summation.
7. The method according to claim 1, wherein in the step S3, the joint optimization of the two classes of loss includes (1) optimizing a global classification loss for narrowing a distance between a global image feature and a corresponding global text-hint feature, and (2) optimizing a local classification loss for narrowing a distance between a local feature extracted from a specific hierarchy and enhanced, and a local text-hint feature constructed for the specific hierarchy.
8. The method of claim 1, wherein the step S4 of fusing the predictive scores comprises (1) computing a global predictive score based on global image features and global text cues, (2) computing a local predictive score based on memory-enhanced local features and multi-layer local text cues, and (3) combining the global predictive score with the local predictive score to generate the final classification decision.

Description

Memory-enhanced multi-layer prompt learning method for classifying few sample images Technical Field The invention relates to the technical field of computer vision, in particular to a memory-enhanced multi-layer prompt learning method for classifying few sample images. Background In recent years, a Visual Language Model (VLMs) represented by the CLIP model has made a breakthrough in the field of artificial intelligence. The model obtains strong cross-modal semantic understanding and zero sample generalization capability by comparing and learning data on a large number of pictures and texts which are not strictly marked, and provides a brand new paradigm for solving various downstream visual tasks, especially image classification. Despite its great potential, it remains a challenge to adapt VLMs efficiently to a particular downstream task. The full-scale fine tuning of the large-scale model containing hundreds of millions or even billions of parameters is directly carried out, so that huge calculation resources and time cost are needed, and the model is extremely easy to forget general knowledge learned in the pre-training stage under the condition of scarce data such as few samples, and the limited training samples are subjected to fitting, so that the generalization performance of the model is damaged. To solve the above problems, prompt learning has been developed as a parameter efficient adaptation strategy. Unlike the traditional fine tuning paradigm, the core idea of hint learning is to freeze the body parameters of the pre-trained model, introducing only a small number of learnable "hint" parameters. In the natural language processing field, this usually manifests itself as a learnable text embedding, in the visual language field, typical prompt learning methods focus and adapt their knowledge to downstream tasks by constructing learnable text prompts instead of manually designed fixed templates (e.g. "a photo of a { class }") and optimizing only these prompt parameters during training. However, as can be seen from the deep analysis of the existing mainstream prompt learning method, there is still a general and deep technical limitation in technical implementation, that the learning target mainly depends on a single and highly concentrated global image feature to be aligned with the text prompt. The global features are typically extracted from the [ CLS ] tokens of the visual encoder (ViT). This "global-text" alignment, while effective in capturing the overall semantics of the image, comes at the expense of ignoring the rich detail attribute information embedded in the individual local block-level features of the image, which is critical to distinguishing fine-grained categories. The root cause of this limitation is because the pre-training goal is to determine whether the overall content of an image matches the overall description of a piece of text, essentially a global level of teletext alignment. This training goal determines that the model's visual encoder is more designed to aggregate the most representative information into global features, rather than performing alignment learning on the output local features. Thus, these original local features tend to exhibit semantic ambiguity (e.g., bird wings and leaf textures may be similar on underlying features) and become entangled with background noise information, resulting in feature quality that is generally low and difficult to learn as a reliable supervisory signal directly for robust, effective local cues. To overcome the above limitations, two search ideas appear in the prior art. The first is to introduce an external Large Language Model (LLMs) to generate detailed attribute descriptions about the categories in hopes of enriching the semantics of the text hints, indirectly directing the model to focus on local details. However, this approach introduces new system complexity, including significant inference delays, the need for complex post-processing mechanisms to filter noise in the LLM output, and severe reliance on external models, limiting the efficiency of the method. The second approach is to build a parallel global-local branch architecture, taking advantage of both global and local features. However, since the local branches directly depend on the original local features of the non-optimized and full noise, the learning effect and the final performance are fundamentally limited by the input bottleneck, and the substantial breakthrough is difficult to be achieved In summary, the technical problems that cannot be solved in the prior art generally exist are how to establish a mechanism which does not depend on external knowledge and parameters and is efficient and economical to calculate under the internal framework of the visual language model, and the mechanism is used for actively and adaptively refining and enhancing the distinguishing quality of original local features and performing robust prompt learning based on the high-quality