CN-122023935-A - Class increment image classification method based on image-text double-guide classifier expansion

CN122023935ACN 122023935 ACN122023935 ACN 122023935ACN-122023935-A

Abstract

The invention discloses a class increment image classification method based on image-text double-guide classifier expansion, aiming at improving the image classification accuracy. The method comprises the steps of constructing a class increment image classification system consisting of a text feature extraction module, an image feature extraction module, a contrast learning classification module, a contrast text generation module, an image-text double-guide classifier, a classification result merging module, a feature mixing module and an entropy-guided loss weighting module, training the text feature extraction module and the image feature extraction module, carrying out preliminary and enhancement training on the image-text double-guide classifier, carrying out loss weighting on an enhanced training feature entropy value calculated according to the entropy-guided loss weighting module in the enhancement training, and cooperatively generating an enhanced training feature set by the classification result merging module, the contrast text generation module and the feature mixing module. The trained class increment image classification system classifies the images to obtain image classes. The method can effectively reduce the prediction entropy value and improve the image classification accuracy.

Inventors

CHEN WEI
DING RUIHUA
HE YULIN
ZHOU WENJUAN
Xun Tianci
TANG MINGXIN
LI LIN

Assignees

中国人民解放军国防科技大学

Dates

Publication Date: 20260512
Application Date: 20260225

Claims (12)

1. The class increment image classification method based on the expansion of the image-text double-guide classifier is characterized by comprising the following steps of: The method comprises the steps of constructing a class increment image classification system based on the expansion of a graphic double-guide classifier, wherein the class increment image classification system comprises a text feature extraction module, an image feature extraction module, a contrast learning classification module, a graphic double-guide classifier module, a classification result merging module, a contrast text generation module, a feature mixing module and an entropy-guided loss weighting module; Second, collecting the incremental learning scene images as class incremental image classification scene data sets, and dividing the incremental learning sample image classification data sets into training sets Test set For a pair of Dividing again according to training stage to obtain training set list , , T is an integer, T is the training phase number of the training process in the class increment image classification field, T is a positive integer, and the training phase number is determined by the training phase number Obtaining a test set list according to the training phase division , ; Thirdly, initializing a training phase t=1; fourth, let the class label sequence of the t training stage be , Is the number of categories contained in the t training stage, and the total category number of the t stage is N is a category label, and the text feature extraction module adopts a general text prompt template pair Building a general text prompt list of a t training stage by the real category corresponding to the category label in the training stage ; Fifth step, training set is utilized Training an image feature extraction module and a text feature extraction module, wherein the text feature extraction module adopts a text feature extraction method to extract a general text prompt list of a t stage The universal text feature is obtained The image feature extraction module adopts an image feature extraction method to extract Image features in (a) to obtain a first image feature The contrast learning classification module calculates a first classification loss Based on Optimizing trainable parameters in the image feature extraction module and the text feature extraction module by using a gradient back propagation method to obtain a trained image feature extraction module and a trained text feature extraction module, and freezing the image feature extraction module and the trained text feature extraction module; sixthly, performing preliminary training on the image-text double-guide classifier, wherein the trained image feature extraction module extracts Obtaining prototype sequences of all categories in the t training stage according to the image features Obtaining weights of the image-text double-guide classifier in the t training stage by using the prototype sequence The trained image feature extraction module adopts an image feature extraction method to extract Middle (f) Batches of Individual images Image features in (a) to obtain a second image feature Image-text double-guide classifier utilization Calculating a second classification loss And is based on Optimizing trainable parameters in the image-text double-guide classifier by using a gradient back propagation method to obtain the image-text double-guide classifier after preliminary training; seventh, the trained text feature extraction module adopts a text feature extraction method to extract Obtaining a second general text feature The comparison learning classification module is based on And Calculating text classification prediction probability of the t training stage The image-text double-guide classifier is based on Calculating image classification prediction probabilities The classification result merging module is used for merging And Merging to obtain merged classification prediction probability Comparing the text generation module pairs The classification prediction probability of each image is sequenced to obtain the maximum prediction labels corresponding to the front Top-N results, and the corresponding comparison text prompts are generated by using a large language model and are transmitted to a text extraction module Mid-text features to obtain contrast text features Will be The method is that: Step 7.1 setting training parameters including initial learning Rate The learning rate adjusting function selects Adam as a model training optimizer; setting a batch size batch_size and a maximum training step maxepoch of network training; Step 7.2 training round number ; Step 7.3 from the training set list Take out training set of the t training stage ; Step 7.4 order batch number Represents the first Image batches, each batch being co-located Individual images, total batch of , ; Step 7.5 order Loss of individual image lots ; Step 7.6, the text feature extraction module adopts a text feature extraction method to extract Obtaining a second general text feature Will be The real labels corresponding to the training set images are sent to a contrast text generation module and a feature mixing module; step 7.7 receiving the second image feature by the contrast learning classification module And a second general text feature Calculating text classification prediction probability of the t training stage Will be Sending the classification result to a classification result merging module; step 7.8 the image-text dual-guide classifier receives the second image feature from the image feature extraction module Calculating image classification prediction probability Will be Sending the classification result to a classification result merging module; step 7.9 Classification result combining Module pair And Adding to obtain the combined classification probability Wherein each behavior is as Transmitting the Logit to a contrast text generation module and an entropy guided loss weighting module; step 7.10 let lot b Numbering of images in individual images ; Step 7.11, the comparison text generation module receives the classification result and the combination module sends the classification result For a pair of The prediction probability of each image is sequenced, and the maximum top-N prediction probabilities are selected to obtain the b batch Top-N predictive labels for individual images , , Ranking the predictive labels of the a-th bit for predictive probability, wherein the value range of a is [1, 2..n ], N is a positive integer; step 7.12 comparing the text generation module to receive the real label corresponding to the mth image of the training set from the text feature extraction module From the slave Top-N predictive tag for obtaining mth image ; Step 7.13, designing a corresponding question template; step 7.14 constructing N corresponding questions Will be Question list combined into mth image Wherein A corresponding question of a; step 7.15 comparing the Large language model to the question List in the text Generation Module Answering to obtain a comparison text prompt list of the mth image List of text prompts to be compared Sending the text characteristics to a text characteristic extraction module; step 7.16 text feature extraction Module receives a list of comparative text cues Order-making Order-making A prompt feature storage list serving as a comparison text prompt list; step 7.17 text feature extraction Module Slave Taking the h contrast text prompt of the m-th image ; Step 7.18, the text feature extraction module adopts a text feature extraction method to extract Chinese text feature Will be Adding in ; Step 7.19 order If (if) Turning to step 7.19, if For a pair of Vector average operation is carried out on all feature vectors in the mth image to obtain the contrast text feature of the mth image Is a one-dimensional vector, the dimension is D, Is a positive integer, will Sending the comparison text to a comparison text generation module; step 7.20 the contrast text generation module receives the contrast text feature of the mth image from the text feature extraction module Will be And the true label of the mth image Forwarding to a feature mixing module; eighth step, feature mixing module performs feature mixing on the mth image Mixing the contrast text features to obtain the mixed features of the mth image, and performing vector stack operation on the mixed features, the general text features and the contrast text features of the mth image to obtain enhanced training features ; Ninth, the entropy guided loss weighting module receives from the feature blending module And Receiving the combined classification probability Logit from the classification result combining module and utilizing The method comprises the steps of performing enhancement training on the image-text double-guide classifier, identifying high-entropy samples in a current class incremental image classification system, and weighting high-entropy sample loss based on entropy values of the samples, wherein the method comprises the following steps: step 9.1, the entropy guided loss weighting module receives Logit from the classification result merging module; Step 9.2, normalizing the Logit application softmax function to obtain normalized probability distribution Prob; step 9.3 deriving a predicted probability distribution for the mth image from the probability distribution Prob Obtaining the corresponding entropy value of the mth image ; Step 9.4 entropy guided loss weighting module receives from the feature blending module ; Step 9.5 if , Is normalized entropy threshold, let Weight of corresponding loss If (1) ; Step 9.6 causing enhancement loss value of mth image ; Step 9.7 Medium enhancement numbering ; Step 9.8 from Fetching the s-th enhanced training feature Sending the image-text double-guide classifier; Step 9.9 image-text double-guide classifier reception And calculate enhanced training features Corresponding image classification prediction probabilities Will be Transmitting to an entropy guided loss weighting module; step 9.10 entropy guided loss weighting Module reception From the slave Enhanced features of the acquired image m belonging to the label Is of the predictive probability of (2) Thereby obtaining the corresponding loss value ; Step 9.11 order ; Step 9.12 order If (if) Turning to step 9.8, if The calculation of the loss of all the enhancement features of the image m is completed, and the final enhancement loss value of the mth image is obtained Turning to step 9.13; Step 9.13 letting the final loss value ; Step 9.14 order If (if) Turning to step 7.11, if Description of the first embodiment Batches of The calculation of the individual image losses is completed, and the step 9.15 is carried out; Step 9.15 entropy guided loss weighting Module will final loss value Sending the image-text double-guide classifier; Step 9.16 image-text double-guide classifier based on Optimizing trainable parameters in the image-text double-guide classifier by using a gradient back propagation method; Step 9.17 order If (if) Turning to step 7.5, if Indicating that the training of the round is finished, and turning to the step 9.18; step 9.18 order If the epoch is less than or equal to maxepoch, turning to the step 7.3, if the epoch is more than maxepoch, indicating that the enhancement training of the image-text double-guide classifier is finished, and obtaining final image-text double-guide classifier parameters in the t training stage, and turning to the tenth step; tenth step, order If (if) Turning to the fourth step, if The method comprises the steps of indicating that T training phases are finished, obtaining a class increment image classification system after training, and turning to an eleventh step; Eleventh step, classifying the images to be classified input by the user by adopting a trained class increment image classification system to obtain the prediction result of the images, wherein the method comprises the following steps: step 11.1, the image feature extraction module receives an image to be classified input by a user ; Step 11.2, the image feature extraction module adopts an image feature extraction method to extract Image characteristics of (a) to obtain Image features of (a) Will be Sending the images and texts to the image-text double-guide classifier and the contrast learning classification module; Step 11.3 image-text double-guide classifier reception Calculating images to be classified Image classification prediction probability of (a) Will be Sending to a classification result merging module: Step 11.4, the text feature extraction module adopts a text feature extraction method to extract general text features corresponding to general text prompts of all classified labels in training Will be Transmitting the data to a contrast learning classification module; Step 11.5 reception by contrast learning Classification Module And Calculating images to be classified Text classification prediction probability of (a) Will be Sending to a classification result merging module: step 11.6 receiving the classification result combining module And Calculation of Predicted value of (2) , The category corresponding to the maximum value in the image is the prediction result of the image to be classified, and the image classification is finished.
2. The text feature extraction module is connected with a contrast learning classification module, a feature mixing module and a contrast text generation module, the text feature extraction module utilizes tags in a training set to construct a general text prompt and extract general text features from the general text prompt, the text feature extraction module consists of a text encoder, T text projection layers and a first adder, wherein parameters of the text encoder adopt parameters of a CLIP pre-training in a transducer architecture, the parameters are kept frozen in training and do not participate in training, the text encoder carries out feature extraction on the general text prompt to obtain preliminary text features and sends the preliminary text features to the T text projection layers, the T text projection layers are all full-connection layers and project the preliminary text features received from the text encoder in parallel to obtain T projection text features containing corresponding task information, the first adder carries out addition on the T projection text features to obtain text features, when the text feature extraction module carries out training, the first text features extracted from the general text prompts corresponding to all the tags in the training set are sent to the contrast text projection module, the text encoder carries out feature extraction on the general text features to the text projection module, the text projection module carries out contrast text feature extraction on the text projection module and sends the preliminary text features to the text projection module after the contrast text feature extraction module receives the preliminary text features and compares the preliminary text features to the text projection module, and the text projection module carries out contrast text feature extraction on the text feature extraction module and compares the text features to the text projection module and extracts text features after the text feature extraction module receives the text features and extracts contrast text features, the real labels and the contrast text features corresponding to the images in the training set are sent to a contrast text generation module; The image feature extraction module is connected with the contrast learning classification module, the image-text double-guide classifier and the feature mixing module, and performs feature extraction on images in the training set to obtain image features; the image feature module consists of an image encoder, T image projection layers and a second adder, wherein the image encoder adopts a parameter which is pretrained in a transducer architecture by the CLIP, the parameter is kept frozen in training and does not participate in training, the image encoder carries out feature extraction on images in a training set to obtain preliminary image features and sends the preliminary image features to the T image projection layers, the T image projection layers are all full-connection layers and project the preliminary image features obtained from the image encoder in parallel to obtain T projection image features containing corresponding task information, the second adder adds the T image projection features to obtain final image features, and when the image feature extraction module trains, the image feature extraction module sends the first image features extracted from images in a training set to a contrast learning classification module; The contrast learning classification module is connected with the text feature extraction module, the image feature extraction module and the classification result merging module, calculates text classification prediction probability by utilizing the general text features received from the text feature extraction module and the image features received from the image feature extraction module, and sends the text classification prediction probability to the classification result merging module; the image-text double-guide classifier is connected with the image feature extraction module, the classification result merging module and the entropy-guide loss weighting module, the image Wen Shuang-guide classifier consists of a single-layer full-connection layer which ignores bias items, predicts the image features received from the image feature extraction module to obtain image classification prediction probability, sends the image classification prediction probability to the classification result merging module, predicts the enhanced training features sent by the entropy-guide loss weighting module, and sends the image classification prediction probability corresponding to the enhanced training features to the entropy-guide loss weighting module; the classifying result combining module is connected with the contrast learning classifying module, the image-text double-guide classifying module, the contrast text generating module and the entropy-guided loss weighting module, combines the text classifying and predicting probability received from the contrast learning classifying module and the image classifying and predicting probability received from the image Wen Shuang-guide classifying module to obtain combined classifying and predicting probability, and sends the combined classifying and predicting probability to the contrast text generating module and the entropy-guided loss weighting module; The comparison text generation module is connected with the classification result merging module, the feature mixing module and the text feature extraction module and provides preparation for the enhancement training of the image-text double-guide classifier; the method comprises the steps that a classification result merging module receives merged classification prediction probabilities, top-N results are obtained from the merged classification prediction probabilities, top-N prediction labels of training set images are obtained, meanwhile real labels corresponding to the training set images sent by a text feature extraction module are received, a large language model is utilized to conduct contrast description on the Top-N prediction labels and the real labels, a contrast text prompt list is obtained, the contrast text prompt list is sent to a text feature extraction module, contrast text features corresponding to the contrast text prompt list extracted by the text feature extraction module are received, and the contrast text features and the real labels corresponding to the training set images are sent to a feature mixing module; The feature mixing module is connected with the text feature extraction module, the image feature extraction module, the contrast text generation module and the entropy-guided loss weighting module, receives image features from the image feature extraction module, receives real labels corresponding to images in the training set and contrast text features from the contrast text generation module, receives general text features from the text feature extraction module, combines the image features, the general text features and the contrast text features to form an enhanced training feature set, and sends the enhanced training feature set and the real labels corresponding to the images in the training set to the entropy-guided loss weighting module; The entropy-guided loss weighting module is connected with the feature mixing module, the classification result merging module and the image-text double-guide classifier; the method comprises the steps of receiving a combined classification prediction probability from a classification result combining module, receiving an enhanced training feature set from a feature mixing module, calculating an entropy value according to the combined classification prediction probability, judging whether a sample image is a high entropy sample or not, forwarding the enhanced training feature in the enhanced training feature set to a graphic double-guide classifier, calculating an image classification prediction probability corresponding to the enhanced training feature by the graphic double-guide classifier, sending the image classification prediction probability to the entropy-guide loss weighting module, calculating a corresponding sample loss by the entropy-guide loss weighting module, increasing a loss weight for the enhanced feature of the high entropy sample during calculation, obtaining the loss of the training set, and sending the loss of the training set to the graphic double-guide classifier; all modules participate in training, and the contrast text generation module, the feature mixing module and the entropy guided loss weighting module only participate in training and do not participate in the classifying process of the user input images.
3. The class increment image classification method based on the image-text double-guide classifier extension according to claim 1 is characterized in that T is 10, a large language model in the contrast text generation module is required to have strong language capability, and the contrast characteristics of two classes can be accurately and detailedly described on the basis of no image.
4. A class delta image classification method based on a dual boot classifier extension as claimed in claim 3, wherein said large language model requirement is GPT-4 or ChatGLM.
5. The method for classifying a class of incremental images based on a dual-guide-by-text classifier extension of claim 1 wherein said classifying the incremental learning sample image classification dataset into a training set in a second step Test set For a pair of Dividing again according to training stage to obtain training set list Will be Obtaining a test set list according to the training phase division The method of (1) is as follows: Step 2.1, collecting incremental learning scene images as class incremental image classification scene data sets, wherein the method comprises the following steps: 9 data sets of a small picture data set CIFAR, a universal scene data set ImageNet-R, a bird data set CUB200, a Food data set Food101, an automobile data set StanfordCars, an action data set UCF101, a scene classification data set SUN397, an airplane classification data set air and an object identification data set Objectnet are used as incremental learning sample image classification data sets, wherein each image of the 9 data sets is marked with a real label of the type of an object; Step 2.2, dividing the increment learning sample image classification data set into training sets according to original division standards in the data set Test set Let the total number of categories of the sample image classification dataset be C, each picture has its corresponding category of real tags 1,2, the combination of C, the number, C ], C is a positive integer; Step 2.3 training set by set partitioning method Dividing again according to training stage to obtain training set list , , And t is an integer, the method is: Step 2.3.1 let training phase t=1; Step 2.3.2 the number of categories included in the t training stage Class label set contained in t training stage = ; Step 2.3.3 find training set Belongs to the field of I.e. from training sets Real label finding and of inner tape The same image will Belongs to the field of Is added to the image of (2) In (a) and (b); step 2.3.4 if Order-making Turning to step 2.3.2, if Description of training set list Turning to step 2.4; Step 2.4 test set Obtaining a test set list according to the training phase division , The method comprises the following steps: step 2.4.1 let training phase t=1, Step 2.4.2 find test set Belongs to the field of I.e. according to the test set Tag finding and of an inner band Identical images, adding these images to In (a) and (b); step 2.4.3 if Order-making Turning to step 2.4.2, if Obtaining a final test set list Turning to the third step.
6. The method for classifying class-increment images based on the expansion of a dual-guide-by-text classifier as claimed in claim 1, wherein said text feature extraction module in the fourth step adopts a general text prompt template pair Building a general text prompt list of a t training stage by the real category corresponding to the category label in the training stage The method of (1) is as follows: Step 4.1 order =1, Initializing generic text prompt list for the t training phase =[]; Step 4.2 Using a Universal text prompt template pair Building a universal text prompt for a real category corresponding to the category label n in the text Will be Added to ; Step 4.3 if Order-making Turning to step 4.2, if General text prompt list describing all categories of the t training stage After construction, obtaining a general text prompt list of the t training stage Turning to the fifth step.
7. The method for classifying class-increment images based on the expansion of a dual-guide-by-text classifier as claimed in claim 1, wherein said training set is utilized in the fifth step The method for training the image feature extraction module and the text feature extraction module comprises the following steps: step 5.1, initializing model parameters of an image feature extraction module and a text feature extraction module; initializing and freezing parameters of a transducer frame in the text feature extraction module by adopting a text encoder of the pre-training model CLIP; step 5.2 random initialization of the image projection layer of the t training stage And text projection layer Will be Projection layer sequence added with image feature extraction module and used for processing image Adding a projection layer sequence of a text feature extraction module, wherein the t-th training stage has an image projection layer sequence And text projection layer sequence ) Connecting the image projection layer sequence of the t training stage with a frozen image encoder, connecting the text projection layer sequence of the t training stage with the frozen text encoder, enabling the characteristics extracted by the image encoder and the text encoder to contain stage task information, and enabling only the image projection layer sequence in the t training stage to be in Trainable, text projection layer sequences only Trainable, remaining projection layers remain frozen; step 5.3, training parameters are set, and the initial learning rate is set Setting a learning rate adjusting function to be a cosine annealing dynamic learning rate adjusting function for 0.05, selecting random gradient descent as a model training optimizer, enabling the batch size batch_size of network training to be 64, and enabling the maximum training step maxepoch to be 10; step 5.4 the text feature extraction module adopts the text feature extraction method to extract the general text prompt list of the t stage The universal text feature is obtained , Is a two-dimensional matrix with dimensions of Is a feature dimension of 512, will The text encoder pair is used for transmitting the text encoder pair to a contrast learning classification module Extracting text features of all general text prompts in the image, obtaining preliminary text features and sending the preliminary text features to an image projection layer sequence ; The first adder adds the t text projection features to obtain the final text feature, namely the general text feature of the current training stage Will be Transmitting the data to a contrast learning classification module; step 5.5, enabling the training round number epoch=1; step 5.6 batch number Represents the first Image batches, each batch being co-located Individual images, total batch of , , If (if) Not be The last batch of images is Is removed from -All images remaining after 1 batch; Step 5.7 image feature extraction Module Slave Fetching training set of the t-th stage From the slave Read the first Batches of Image, will be Batches of The individual images being recorded in matrix form Wherein H represents the height of the input image, W represents the width of the input image, and "3" represents the RGB three channels of the image; Step 5.8, the image feature extraction module adopts an image feature extraction method to extract Image features in (a) to obtain a first image feature , Is a two-dimensional matrix with dimensions of Will be The image encoder pair is used for transmitting the image data to the contrast learning classification module Extracting features of the middle image to obtain primary image features, and transmitting the primary image features to an image projection layer sequence ; Projecting the preliminary image features in parallel to obtain t projected image features containing corresponding task information, and adding the t projected image features by a second adder to obtain final image features Will be Transmitting the data to a contrast learning classification module; step 5.9 the contrast learning classification module accepts from the image feature extraction module Accepting from text feature extraction module ; Step 5.10 the comparison learning classification module calculates a first classification loss according to equation (1) : Formula (1); i represents First, the Batches of The number of the individual images is set, Representation of First, the Batches of The true label of the ith image in the images, Representation of First, the Batches of The image features of the i-th image of the images, Label representing ith image Is used for the text feature of (a), Representation calculation And Is used for the cosine similarity of the (c), Representing the temperature coefficient, and taking the value as 0.1; Step 5.11 the comparison learning classification module is based on Optimizing trainable parameters in the image feature extraction module and the text feature extraction module by using a gradient back propagation method; Step 5.12 order If (if) Turning to step 5.7, if Indicating that the training of the round is finished, and turning to step 5.13; Step 5.13 order If the epoch is less than or equal to maxepoch, turning to the step 5.6, and if the epoch is more than maxepoch, indicating that the training of the image feature extraction module and the text feature extraction module is finished, obtaining the trained image feature extraction module and text feature extraction module, and freezing the image feature extraction module and the text feature extraction module, and turning to the sixth step.
8. The method for classifying the class incremental image based on the expansion of the dual-guide image classifier according to claim 1, wherein the method for performing preliminary training on the dual-guide image classifier by using the training set image features extracted by the trained image feature extraction module in the sixth step is as follows: Step 6.1 the trained image feature extraction module extracts the training set list from the training set list Take out training set of the t training stage ; Step 6.2 trained image feature extraction Module from The prototype sequences of all classes in the t training stage are obtained: , Is the prototype of the kth category, And stores the prototype sequences of all categories in the t training stage into the total sequence P of the t stage, The method is as follows: Step 6.2.1 order the first class number of the t training stage ; Step 6.2.2 slave The image of the kth category is read, and all the images of the kth category are recorded as a matrix form ; Step 6.2.3 image feature extraction Module extraction Image features in (a) ; Step 6.2.4 pairs Averaging to obtain prototype of the kth class Will be Putting the model sequences of all categories in the t training stage; step 6.2.5 order If (if) Turning to step 6.2.2, if Description of prototype sequences of all classes that have been obtained in the t-th training stage Turning to step 6.3; step 6.3 use of prototype sequence Weights of image-text double-guide classifier in t training stage If (if) then , As in formula (2): formula (2); If it is , By means of Final image-text double-guide classifier weight with t-1 training stage The connection is obtained as shown in a formula (3): formula (3); Step 6.4, training parameters are set, and the initial learning rate is set Setting a learning rate adjusting function to be a cosine annealing dynamic learning rate adjusting function for 0.0005, selecting adaptive moment estimation Adam as a model training optimizer, enabling the batch size batch_size of network training to be 64, and enabling the maximum training step maxepoch to be 3; Step 6.5 training round number ; Step 6.6 batch number Represents the first Image batches, each batch being co-located Individual images, total batch of , , ; Step 6.7 from Read the first Batches of Image, will be Batches of The individual images being recorded in matrix form ; The image feature extraction module trained in the step 6.8 adopts an image feature extraction method to extract Image features in (a) to obtain a second image feature Is a two-dimensional matrix with dimensions of Will be Transmitting the images to a contrast learning classification module, an image-text double-guide classifier and a feature mixing module; Step 6.9 graphic double-guide classifier calculates second classification loss As shown in formula (4): formula (4); Representation of The image features of the i-th image of (b), Indicating that the ith image belongs to the label Is used for predicting the probability of (1); step 6.10 image-text double-guide classifier is based on Optimizing trainable parameters in the image-text double-guide classifier by using a gradient back propagation method; Step 6.11 order If (if) Turning to step 6.7, if Indicating that the training of the round is finished, and turning to the step 6.12; Step 6.12 order If the epoch is less than or equal to maxepoch, turning to the step 6.6, and if the epoch is more than maxepoch, indicating that the preliminary training of the image-text double-guide classifier is finished, and turning to the seventh step.
9. The method for classifying class increment image based on image-text dual-guide classifier extension as claimed in claim 1, wherein said step 7.1 is to set training parameters by making initial learning rate Is that Setting a learning rate adjustment function as a cosine annealing dynamic learning rate adjustment function, enabling the batch size batch_size of network training to be 64, enabling the maximum training step length maxepoch to be 3, and enabling the maximum training step length to be 3 in the step 7.4 The text classification prediction probability of the t training stage in the step 7.7 Calculated according to the formula (5) Is a two-dimensional matrix with dimensions of : Equation (5); Representation pair The normalization operation is performed so that the data of the data are obtained, The value of the temperature parameter is 0.1, and the image classification prediction probability in the step 7.8 Calculated according to the formula (6), Is a two-dimensional matrix with dimensions of : Equation (6); in said step 7.11 Is a two-dimensional matrix with dimensions of Wherein each row ; The question template in the step 7.13 is "What are unique visual features of [CLASS] i compared to [CLASS] j in a photo? Focus on their key visual differences.";, and N corresponding questions are constructed in the step 7.14 The method is that use Replacing [ CLASS ] i in the question template, use Sequentially replacing [ CLASS ] j .
10. The method for classifying class-increment images based on the expansion of a dual-guide-by-text classifier as claimed in claim 1, wherein said feature blending module in the eighth step blends the mth image Mixing the contrast text features to obtain the mixed features of the mth image, and performing vector stack operation on the mixed features, the general text features and the contrast text features of the mth image to obtain the enhanced training features, wherein the method comprises the following steps: Step 8.1 the feature blending module receives the true label of the mth image from the contrast text generation module And Receiving from a text feature extraction module Receiving from an image feature extraction module From the slave Taking out the general text feature corresponding to the m-th image real label Fetching image features of an mth image ; Step 8.2 feature blending Module will And Mixing to obtain mixed characteristics As in formula (7): Formula (7) Wherein the method comprises the steps of To compare the mixing ratio of the text feature and the image feature, set to 0.75; step 8.3 feature blending Module will Performing vector stack operation to form the first Enhanced training features for batch mth image , Is a two-dimensional matrix with dimensions of A total of 3 enhancement features, to True label with mth image And sending the result to an entropy guided loss weighting module.
11. The method for classifying class-increment images based on the expansion of a dual-guide-by-text classifier as claimed in claim 1, wherein said Prob in step 9.2 is a two-dimensional matrix with dimensions of Step 9.3 Is a one-dimensional vector with the dimension of Corresponding entropy value of mth image The calculation formula of (a) is as formula (8): Formula (8) Wherein the method comprises the steps of For predicting probability distribution from mth image The prediction probability of the corresponding j-th label is taken out; Said step 9.5 Set to 0.7, in the step 9.8 Is a one-dimensional vector with dimension D, and the loss value in the step 9.10 Calculated according to the formula (9): equation (9).
12. The method for classifying class-increment images based on the expansion of a dual-guide-by-text classifier as claimed in claim 1, wherein said step 11.2 is Is a two-dimensional matrix with dimensions of Step 11.3, described in The calculation formula of (2) is: step 11.5 said The calculation formula of (2) is: step 11.6 Predicted value of (2) Calculated according to the formula (10), Equation (10); Wherein the method comprises the steps of The value is 0.6, and the method is used for controlling the output proportion of the image-text double-guide classifier and the contrast learning classification module.

Description

Class increment image classification method based on image-text double-guide classifier expansion Technical Field The invention relates to the field of class increment image classification, in particular to a class increment image classification method based on expansion of a graphic double-guide classifier. Background With the rapid development of artificial intelligence and big data technology, machine learning algorithms are widely used in various fields such as image recognition, intelligent security, automatic driving, medical image analysis and the like. However, conventional image classification models typically require a centralized training on annotation data that contains all classes, and retraining the entire model as the data distribution changes or new classes appear, which is impractical in real scenarios where the data continues to grow, and the classes dynamically expand. For example, in the situations of new goods on an e-commerce platform, new identification targets in a monitoring system, personalized album classification in mobile application and the like, the traditional method needs to collect and retrain a full amount of data regularly, so that high calculation resource and time cost are brought, and the requirements of real-time response and efficient update are difficult to meet. To solve this problem, class-increment image Classification (Class-INCREMENTAL IMAGE Classification) has been developed. Class delta image classification aims at enabling models to learn based only on data of newly added classes without retraining existing classes, while maintaining as much as possible recognition capability of the learned classes. The method obviously reduces the calculation cost and the storage requirement, improves the continuous learning and adaptation capability of the model, and is more suitable for dynamic environments of data stream arrival and category step-by-step growth in the real world. However, class delta image classification faces a core challenge in that model parameter updates tend to override or attenuate the characterization of learned classes during learning of new classes, resulting in significant performance degradation, namely catastrophic forgetting (Catastrophic Forgetting) phenomenon. This causes a fundamental contradiction between stability, which refers to the ability of the model to retain old knowledge, and plasticity, which refers to its ability to adapt to new knowledge, in class incremental learning. There has been a great deal of research devoted to alleviating the problem of forgetfulness in pre-trained models. The paper of document "Zhou D W, Cai Z W, Ye H J, et al. Revisiting class-incremental learning with pre-trained models: Generalizability and adaptivity are all you need[J]. International Journal of Computer Vision, 2025, 133(3): 1012-1032."Zhou et al, the paper of document "Zhou K, Yang J, Loy C C, et al. Learning to prompt for vision-language models[J]. International Journal of Computer Vision, 2022, 130(9): 2337-2348."Zhou et al, the paper of document "Wang Z, Zhang Z, Ebrahimi S, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning[C]//European conference on computer vision. Cham: Springer Nature Switzerland, 2022: 631-648."Wang et al, the paper of document (DualPromot), the paper of document (prompts) is generated/selected in a mode of ' decomposing attention ' to improve the continuous task of the pre-training model by learning how to construct a prompt (L2P) for a vision-language model, and combining the most relevant prompts according to input retrieval, thereby slowing down forgetting without replaying data, the paper of document (6272) et al, the paper of document ("Smith J S, Karlinsky L, Gutta V, et al. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 11909-11919."Smith et al), the paper of document (prompts) is decomposed into two types of complementary prompts by a complementary prompt method (DualPromot) of ' general +expert ', and the paper of document ("Smith J S, Karlinsky L, Gutta V, et al. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 11909-11919."Smith et al) is simultaneously considered to be stable and plastic in the continuous learning of ' non-replay. In recent years, the advent of large-scale vision-language pre-training models provides a new powerful basis for class-increment image classification. CLIP (see article "Alec Radford, et al. Learning transferable visual models from natural language supervision.International conference on machine learning. PMLR, 2021." Alec Radford et al: learn transferable visual models from natural language supervision) is a classical visual-language large model that builds aligned cr