CN-119091197-B - Multi-mode prompt learning second-order few-sample medical image classification method, system, storage medium and device

CN119091197BCN 119091197 BCN119091197 BCN 119091197BCN-119091197-B

Abstract

The invention belongs to the technical field of computer vision image processing and provides a multi-mode prompt learning second-order few-sample medical image classification method, a system, a storage medium and equipment, which comprise the steps of acquiring few-sample medical image data, introducing auxiliary text input to further describe image or concept categories and promote cross-mode learning, and combining two classification heads, wherein one is a shared classification head for processing image category marks in a vision encoder and prompt representations coded by a text encoder; another is to categorize the visual marker feature distribution from the visual encoder by aggregating the visual markers using global covariance pooling with efficient matrix power normalization. The invention obviously improves the precision of classifying the medical images with few samples.

Inventors

WANG ZHENWEI
ZHANG QIANG
YU SHUO
WANG PENGFEI
Zhao Miaoyun
YANG YINCHAO
SU WEIJIAN

Assignees

大连理工大学

Dates

Publication Date: 20260512
Application Date: 20240813

Claims (9)

1. A multi-mode prompt learning second-order few-sample medical image classification method is characterized by comprising the following steps: s1, acquiring medical image data with few samples for model training, introducing text input for each image, and further describing the image or concept category; S2, constructing a model, namely using CLIP as a backbone network, comprising an image encoder and a text encoder, respectively taking medical images and category descriptions as visual input and text input, extracting features of the image encoder and the text encoder, and generating category marks by the image encoder Visual indicia The text encoder generates a global sentence representation; S3, constructing a visual classifier, wherein the classifier comprises a shared classifier and a global covariance pooling classifier, the shared classifier is used for carrying out class prediction based on text features and visual class tokens, and the global covariance pooling classifier is used for realizing linear classification of feature distribution by combining first-order and second-order statistics; The sharing classifier of the step S3 is used for carrying out category prediction based on the [ EOS ] token from the text encoder and the category label from the image encoder; The global covariance pooling classifier in the step S3 is used for predicting the category of the visual part, wherein the CLIP model is provided with a category mark and a plurality of visual marks, the visual marks are from the output of a block before the CLIP model classifier is pre-trained, linear detection takes the average pooling of the category marks or the visual marks as input to generate the prediction, and the global covariance pooling classifier is required to calculate a covariance difference pool of the visual marks for second-order statistical modeling; The calculation of the covariance pool of the visual marker refers to replacing the original average pooling operation with a characteristic function, wherein the characteristic function is obtained by modeling the characteristic distribution of the visual marker, can be used for describing a complete image of the characteristic, and is standardized by Newton-Schulz iteration to obtain a square root standardized covariance matrix; s4, only updating parameters of a learning prompt and a classifier of CoOp in a training stage, and only predicting a new medical image in an reasoning stage; s5, outputting medical image classification results.
2. The method for classifying medical images with two-order few samples for multi-modal prompt learning according to claim 1, wherein the step S1 performs data enhancement processing on the acquired medical image data with few samples before text input is introduced into each image.
3. A multi-modal prompt learning second order less sample medical image classification method in accordance with claim 1 wherein a prompt method is selected from a prompt pool to generate a prompt for each class prior to model training; then, extracting visual and text features using respective encoders; the prompt pool comprises five different prompt methods, including class names, common prompts, manual prompts, prompts generated through GPT and CoOp learnable prompt methods; The GPT generating prompt adopts a transducer architecture and performs extensive pre-training on a huge text data set to realize coherent text generation and understanding, and uses an interactive tool based on GPT 3.5 To create different categories of cues, in particular, two methods are employed to generate the cues, one is to confirm that the category of cues in the dataset are requested prior to the cues being requested Secondly, a template customized for the data set is used for making prompts, namely, GPT is asked by combining the template with specific categories, and category related contents are given; the said CoOp learning prompt is that the input formula of the designed text encoder is expressed as follows using the framework of CoOp: wherein each of Match the dimension of word embedding and are in all categories Is shared between the two devices, wherein the two devices are connected in parallel, , Is a superparameter that determines the number of context markers.
4. The method for classifying a multi-modal prompt-learning second order few-sample medical image as claimed in claim 1, wherein in the step S2, the text encoder is used for generating a characteristic representation of the text description, and the specific process is that each mark in the text description is encoded into unique digital IDs through a byte pair encoding algorithm, the digital IDs are mapped to 512-dimensional word embedding vectors and are input into a text encoder stacked with a transducer layer, each text description is surrounded by SOS marks and EOS marks, Representing a "start sentence", Representing an 'end sentence', and finally, using the feature at the [ EOS ] mark position as a global sentence representation through linear projection; in the step S2, the image encoder of the CLIP can adopt ResNet or ViT architecture As an image encoder, an image is divided into N non-overlapping image blocks and projected to a patch for embedding In these embedments And a learnable class label Is input to When ResNet is used as an image encoder, the images sequentially pass through a convolution layer and multi-head attention pooling, and a learnable class mark is combined with the visual characteristics of the image blocks; The text encoder and the image encoder are frozen during the training process and only the parameters of the learnable cues and classifier of the method CoOp involved are updated.
5. The method for classifying medical images with two-order few samples for multi-modal prompt learning according to claim 1, wherein in step S4, the text labels are the same as the image labels, the text prompts serve as auxiliary samples for enhancing the interaction between text and vision, and in the reasoning stage, the shared classifier performs class prediction based on [ EOS ] tokens from the text encoder and class labels from the image encoder.
6. The multi-modal prompt learning second order few sample medical image classification method of claim 1, wherein AdamW algorithm is used in combination with a warm-up phase followed by cosine annealing learning rate planning, the warm-up phase comprises 50 iterations, the whole training spans 12,800 iterations, and three different random seeds are used to sample a specified number of small instances from the training set as training samples for the model.
7. A multi-modal prompt learning second order few sample medical image classification system, comprising: The data acquisition module is used for acquiring medical image data with few samples; The text prompt generation module is used for generating a text prompt, namely selecting one prompt method from the prompt pool to generate the text prompt, wherein different prompt methods in the prompt pool comprise class names, common prompts, manual prompt making, prompt generation through GPT and CoOp learnable prompt methods; The model building module is used for taking the CLIP as a backbone network and comprising an image encoder and a text encoder, taking medical images and category descriptions as visual input and text input respectively, and extracting characteristics through the image encoder and the text encoder; the visual classifier comprises a shared classifier and a global covariance pooling classifier, wherein the visual classifier is used for carrying out classification prediction of visual features; the training module is used for using the text prompt as an auxiliary sample in a training stage, enhancing the interaction between the text and the vision and carrying out classification prediction; The reasoning module is used for predicting only the medical image in the reasoning stage and simplifying the reasoning flow; The output module is used for outputting a medical image classification result with fewer samples; The sharing classifier of the step S3 is used for carrying out category prediction based on the [ EOS ] token from the text encoder and the category label from the image encoder; The global covariance pooling classifier in the step S3 is used for predicting the category of the visual part, wherein the CLIP model is provided with a category mark and a plurality of visual marks, the visual marks are from the output of a block before the CLIP model classifier is pre-trained, linear detection takes the average pooling of the category marks or the visual marks as input to generate the prediction, and the global covariance pooling classifier is required to calculate a covariance difference pool of the visual marks for second-order statistical modeling; The calculation of the covariance pool of the visual marker refers to replacing the original average pooling operation with a characteristic function, wherein the characteristic function is obtained by modeling the characteristic distribution of the visual marker, can be used for drawing a complete image of the characteristic, and is standardized by Newton-Schulz iteration to obtain a square root standardized covariance matrix.
8. A computer readable storage medium having stored thereon a program, which when executed by a processor, implements the steps in the multi-modal prompt learning second order less sample medical image classification method of any of claims 1-6.
9. An electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, wherein the processor, when executing the program, implements the steps in the multi-modal prompt learning second order low sample medical image classification method of any one of claims 1-6.

Description

Multi-mode prompt learning second-order few-sample medical image classification method, system, storage medium and device Technical Field The invention relates to the technical field of medical image processing, in particular to a multi-mode prompt learning second-order few-sample medical image classification method, a system, a storage medium and equipment. Background The multi-mode recognition technology fuses data of multiple modes such as text, images, sound and the like, and enhances the performance of tasks such as image recognition and the like. The visual language pre-training model CLIP learns a movable visual model from natural language supervision by constructing a multi-modal framework, and expands the application range of the model. By introducing a learnable text prompt vector, lossless adjustment of an original model is realized, and the method is suitable for diversified dataset challenges. Medical image classification methods rely on the recent progress of deep learning, and with the development of deep learning techniques, especially the introduction of Convolutional Neural Networks (CNNs) and transformers, the accuracy of medical image classification is significantly improved. However, traditional supervised learning approaches face challenges due to the scarcity and labeling difficulty of medical image data. The few sample learning trains a model with good performance through limited sample data, and provides a solution for medical image classification under the condition of data scarcity. The vision-language pre-training model performs multi-modal learning through the corresponding relation between the image and the text, and provides excellent zero sample and few sample performances. The CLIP model achieves alignment between images and text through extensive pre-training. However, these models have limited application in medical image analysis. The development of vision-language models for medical images is of great importance. The method combines the vision-language model with the medical image classification, and improves the accuracy and generalization capability of the medical image classification by introducing multi-modal learning and prompt learning technologies. Prompt learning directs models to adapt to new tasks by designing or learning prompts. In medical image classification, learning is prompted to help the model better understand and identify objects in the medical image. The present patent explores a variety of hinting methods, including manually designed class names, common methods, and manually crafted methods, uses GPT to generate descriptions related to data categories as hints, provides additional information, and uses learnable hints to improve classification performance. Furthermore, CLIP uses only first-order information as an overall representation, which ignores potentially higher-order related information between features. Second order pooling captures global information by computing covariance matrices between feature maps. In medical image classification, the second-order pooling helps the model to utilize higher-order statistical information to improve classification accuracy. The method applies second-order pooling to medical image classification, combines multi-mode learning and prompt learning technologies, further improves model performance, discusses characteristics and applicability of different second-order pooling methods, and provides a new thought for a large-scale vision-language pre-training model in medical image analysis tasks. Disclosure of Invention Aiming at the defects in the prior art, the invention provides a multi-mode prompt learning second-order few-sample medical image classification method, which further describes images or concept categories by introducing text input, promotes the few-sample learning of medical images, and improves the accuracy and generalization capability of medical image classification. In order to achieve the above purpose, the present invention adopts the following technical scheme: a multi-mode prompt learning second-order few-sample medical image classification method comprises the following steps: S1, acquiring medical image data with few samples for model training. Text input is introduced for each image, further describing the image or concept category. Further, the step S1 performs data enhancement processing on the acquired medical image data with a small sample before text input is introduced into each image. S2, constructing a model, wherein the model uses the CLIP as a backbone network and comprises an image encoder and a text encoder. The medical image and the category description are respectively used as visual input and text input, the image encoder generates a category mark X c and a visual mark X v through the extraction characteristics of the image encoder and the text encoder, and the text encoder generates a global sentence representation; Further, before model training, a Prompt method needs