CN-117541855-B - Interactive semantic perception self-learning framework and interpretable visual recognition method

CN117541855BCN 117541855 BCN117541855 BCN 117541855BCN-117541855-B

Abstract

The invention discloses an interactive semantic perception self-learning framework and an interpretable visual recognition method, wherein the interactive semantic perception self-learning framework comprises a teacher module and a student module, the student module is used for visual recognition, the teacher module is used for conducting semantic guidance on the student module, the student module inputs calculated slice-feature pairs into the teacher module, the teacher module outputs calculated slices rich in semantics to the student module, the teacher module comprises a first encoder, a category concept library and a similarity comparison sub-module, and the student module comprises a slice library, a second encoder and a feature selection sub-module. The method accurately captures the characteristics of different granularities, has excellent generalization and accuracy, enhances the interpretive property of the alignment of the abstract semantic concepts and specific image areas, and is compatible with networks of different structures.

Inventors

WANG KEZE
JIANG HAO
LIN JING
LI HAOWEI
CHEN JUNHAO
WAN WENTAO
XUE LEI

Assignees

中山大学

Dates

Publication Date: 20260508
Application Date: 20231101

Claims (9)

1. An interpretable visual recognition method based on an interactive semantic perception self-learning system is characterized by comprising the following steps of: S1, dividing images and inputting the divided images into a teacher module and a student module respectively, and initializing the teacher module and the student module; S2, the student module performs feature selection on the segmented image to generate segment-feature pairs; s3, inquiring corresponding indexes in a teacher module by using the fragment-feature pairs and generating patch-specific feature responses; S4, comparing the characteristic response with the index pair by the teacher module, and identifying and generating a slice rich in semantics, wherein the specific process is as follows: (1) Similarity comparison Constructing a matrix There is Line sum Columns in which Is the input batch size of the batch to be processed, Is the number of cuts in the original sample input, and cosine similarity is used to calculate the similarity: Wherein, the The jth patch feature representing the ith sample in the input batch Concept patch features with corresponding categories I.e., similar patch characteristics, Is a characteristic response The j-th element of the (c) is selected, Is a feature vector Is the j-th element of (2); (2) Semantic patch optimization Further select in each sample Similar patch features act as semantic patches: Wherein the method comprises the steps of Representing a new sample after the semantic patch is optimized, Representing samples before semantic patch optimization, top- Before representation The patch features with the highest similarity are identified, Representing the characteristics of the semantic patch after selection, Representing normalized semantic patch features; (3) Sample optimization Construction of sample similarity vectors It is one Dimension vector, each element representing a current sample feature in a class concept library Class representation with current sample feature Similarity between (1), wherein Is the batch size of the input samples, and then before selection Sample features with highest similarity After the optimization, the teacher module generates the patch with the semantic meaning I.e. a slice rich in semantics; S5, updating a teacher module and a student module by using the slices rich in semantics; And S6, classifying the pictures by using the updated student modules.
2. The method for the interpretable visual recognition based on the interactive semantic perception self-learning system according to claim 1, wherein in the step S1, the images are segmented and then respectively input into a teacher module and a student module, and the teacher module and the student module are initialized as follows: Dividing an input image into pieces with preset size, forwarding the pieces to a student module and a teacher module, storing the received pieces into a piece library by the student module, inputting the received pieces into an encoder by the teacher module for feature extraction, and updating a category concept library by using the obtained features.
3. The method for interpretable visual recognition based on the interactive semantic perception self-learning system according to claim 1, wherein in step S2, the student module performs feature selection on the segmented image to generate segment-feature pairs, which comprises the following steps: the student module inputs the slices in the slice library into an encoder in the student module for feature extraction, and inputs the obtained features into a feature selection sub-module to generate slice-feature pairs; first, an attention weight matrix is obtained according to an attention mechanism : Wherein the method comprises the steps of Is the original sample of the input and, Representing the parameters in the encoder, Representing the process of feature selection of an input raw sample, N is the number of cuts in the input raw sample, Representing patch selection, patches are indexed using i and j, ps representing patch selection; second, the one with the highest attention weight is identified The element(s), And finally, selecting the input image corresponding to the front in the attention matrix The characteristics of the row and column indices of the individual values.
4. The method for interpretable visual recognition based on the interactive semantic perception self-learning system according to claim 1, wherein the step S3 of using the slice-feature pairs to query the corresponding index and generate patch-specific feature responses in the teacher module is as follows: (1) Querying corresponding indexes in category concept libraries using fragment-feature pairs Teacher module maintains a category concept library matrix Each row of The feature vector is obtained after the corresponding category passes through the teacher module: Wherein the method comprises the steps of Representing a sample In a matrix Is used for the category index of the category, Is a one-hot column vector, representing Is a category of (2); (2) Generating feature responses using fragment-feature pairs Setting up a teacher module to know partial parameters of students, wherein the teacher module receives the selected features from the feature selection sub-module As input and generate a characteristic response : Is a learning parameter of the teacher module, () A process is generated for the characteristic response.
5. The method for the interpretable visual recognition based on the interactive semantic perception self-learning system according to claim 1, wherein the updating of the teacher module and the student module with the semantic rich slices in step S5 is performed as follows: (1) Updating category concept libraries using semantically rich slices By patches with semantics To update the category concept library according to the momentum mechanism: Wherein the method comprises the steps of A matrix of a concept pool of categories is represented, Is a super parameter, and balances the weights of the corresponding category concept representations in the current sample and the maintained category concept library when updating the category concept library; (2) The slice library is updated with the semantically rich slices.
6. The method for interpretable visual recognition based on the interactive semantic perception self-learning system according to claim 1, wherein the step S6 uses the updated student module to classify the pictures, and the specific process is as follows: Predicting tags of pictures using semantic rich slices as input data for student modules: Wherein, the Is a learning parameter of the student module and is used for learning the student module, The process of patch selection is indicated as such, Representing the predicted tag.
7. The method for identifying interpretable vision based on the interactive semantic perception self-learning system according to claim 6, wherein the step of using the slice rich in semantics as the input data of the student module to predict the label of the picture introduces a contrast semantic loss comprises the following steps: The new loss function is expressed as: wherein β is a weight factor; is the original cross-entropy loss and, Is a contrast learning loss, defined as: Wherein the method comprises the steps of And Is a processed image for which patch selection has been completed, Is that And Is used for the cosine similarity of the (c), Representation of Is used to determine the similarity of the two images, 、 Is the label corresponding to two comparison samples in the comparison loss, and N is the number of cuts in the input original sample.
8. An interactive semantic perception self-learning system for an interpretable visual recognition method based on the interactive semantic perception self-learning system according to any one of claims 1 to 7, wherein the interactive semantic perception self-learning system comprises a teacher module and a student module, the student module is used for visual recognition, the teacher module is used for conducting semantic guidance on the student module, the student module inputs calculated slice-feature pairs into the teacher module, the teacher module outputs calculated slices rich in semantics to the student module, the teacher module comprises a first encoder, a category concept library and a similarity comparison sub-module, the first encoder is used for extracting features of an image, the category concept library is used for storing category concepts, the similarity comparison sub-module is used for generating slices rich in semantics, namely semantic guidance, the student module comprises a slice library, a second encoder and a feature selection sub-module, the slice library is used for storing image slices used for classification, and the feature selection sub-module is used for conducting feature selection on image slice features generated by the second encoder.
9. The interactive semantic aware self-learning system of claim 8, wherein the second encoder of the student module and the first encoder of the teacher module are the same encoder, have the same network structure, and the teacher module and the student module share one encoder, the encoder is a pre-trained feature extraction network, and the feature extraction network is CNN, viT, and Swin-transporter; the class concept library stores a global concept feature of each image class, namely a class concept, and the structure of the class concept library is a dictionary structure and comprises keys and values, wherein the keys are classes, and the values are global feature vectors corresponding to the classes.

Description

Interactive semantic perception self-learning framework and interpretable visual recognition method Technical Field The invention relates to the technical field of interpretable visual recognition in computers, in particular to an interactive semantic perception self-learning framework and an interpretable visual recognition method. Background The interpretable visual recognition is intended to generate an interpretable and transparent feature representation, thereby enhancing the interpretability and traceability of the enhanced visual recognition model. Currently, there are two main directions in the interpretable visual recognition study, self-interpretation models and post-hoc analysis methods. The self-explanatory model is a model with transparency and interpretability, and concepts generated by the visual recognition model can be extracted. Post hoc analysis methods improve the interpretability of the model by analyzing the outputs in the visual recognition model, and some related studies have focused on learning the concentrated semantic regions of the image region using advanced convolution layers. Such as ProtoPNet and ProtoPFormer. ProtoPNet is a method for accurately sensing and identifying the discriminating portion of an object using a specific class prototype based on Convolutional Neural Network (CNN). ProtoPNet compares the input picture with the selected specific class prototype and calculates the similarity to make the operation visually interpretable, but this method relies heavily on the choice of class prototype and the choice of class prototype is coarser. ProtoPFormer extend ViT architecture using a prototype-based approach to achieve interpretable visual recognition. However, due to lack of manual guidance in the model training process, the interpretability in aligning abstract semantic concepts with specific image regions is still limited, and the extraction of image regions is coarser. Meanwhile, the existing interpretable method is limited in application framework, and generally only supports CNN or ViT. Disclosure of Invention The invention provides an interactive semantic perception self-learning framework and an interpretable visual recognition method, which are used for overcoming the defects that the prior art lacks manual guidance in the visual recognition model training process, has limited interpretability in the aspect of aligning abstract semantic concepts with specific image areas and has limited applicability to the existing interpretable method. The primary purpose of the invention is to solve the technical problems, and the technical scheme of the invention is as follows: The first aspect of the invention provides an interactive semantic perception self-learning framework, which comprises a teacher module and a student module, wherein the student module is used for visual recognition, the teacher module is used for conducting semantic guidance on the student module, the student module inputs calculated slice-feature pairs to the teacher module, the teacher module outputs calculated slices rich in semantics to the student module, the teacher module comprises a first encoder, a category concept library and a similarity comparison sub-module, the encoder is used for extracting features of an image, the category concept library is used for storing category concepts, the similarity comparison sub-module is used for generating slices rich in semantics, namely semantic guidance, the student module comprises a slice library, a second encoder and a feature selection sub-module, the slice library is used for storing image slices used for classification, and the feature selection sub-module is used for carrying out feature selection on the image slice features generated by the encoder. Further, the encoder is a pre-trained feature extraction network, and the feature extraction network can be CNN, viT and Swin-transporter, wherein the second encoder of the student module and the first encoder of the teacher module are the same encoder, have the same network structure, and the teacher module and the student module can share one encoder; the class concept library stores a global concept feature of each image class, namely a class concept, and the structure of the class concept library is a dictionary structure and comprises keys and values, wherein the keys are classes, and the values are global feature vectors corresponding to the classes. The second aspect of the invention provides an interpretable visual recognition method based on an interactive semantic perception self-learning framework, which comprises the following steps: S1, dividing images and inputting the divided images into a teacher module and a student module respectively, and initializing the teacher module and the student module; S2, the student module performs feature selection on the segmented image to generate segment-feature pairs; s3, inquiring corresponding indexes in a teacher module by using the fragment-f