CN-121982430-A - Image classification method based on joint embedded prediction architecture of local perception and global alignment

CN121982430ACN 121982430 ACN121982430 ACN 121982430ACN-121982430-A

Abstract

The invention discloses a learning method of a joint embedded prediction architecture based on local perception and global alignment, which comprises the steps of firstly reading an image sample from a data loader to generate an original view and an enhanced view; the method comprises the steps of inputting an original view into a target encoder, inputting the original view and an enhanced view into a context encoder and a predictor respectively, calculating embedded prediction consistency loss according to a prediction token and a real target token, carrying out aggregation on two-view prediction characterization, obtaining global embedding through a judging projection head, calculating local-global semantic consistency regular loss based on local prediction characteristics and global embedding, carrying out self-adaptive weighting on target prediction errors by an uncertainty prediction head, and finally synthesizing the losses into total losses according to weights and carrying out back propagation updating on trainable parameters. The method optimizes alignment of local embedded prediction and global discrimination, and improves robustness of feature characterization and downstream recognition performance.

Inventors

Li Youhuizi
HU SHUYUAN
YIN YUYU
LIANG TINGTING
SUN QIANQIAN
LI YU

Assignees

杭州电子科技大学

Dates

Publication Date: 20260505
Application Date: 20260403

Claims (13)

1. An image classification method based on a joint embedded prediction architecture of local perception and global alignment is characterized by comprising the following steps: S1, acquiring a batch of image samples, preprocessing the image samples to generate an original view X and a corresponding enhanced view X ', and respectively generating a context index set indicating a visible region and a target index set indicating an occluded region for the original view X and the enhanced view X'; S2, respectively inputting the original view X and the enhanced view X 'to a target encoder to obtain a real target representation, inputting the original view X and the enhanced view X' to a top-bottom Wen Bianma device to obtain a context feature, and inputting the context feature and target position information to a predictor to obtain a predicted representation of a target region; S3, carrying out uncertainty perception dynamic feature fusion on the prediction characterization and the real target characterization, and generating an optimized target characterization; S4, calculating prediction consistency loss based on the optimized target representation and the real target representation, calculating contrast learning loss based on a global embedded representation obtained by aggregating the optimized target representation, and calculating local-global semantic consistency loss for restraining consistency of the global embedded representation and local semantic features; S5, constructing a total loss function according to the weighted sum of the prediction consistency loss, the comparison learning loss and the local-global semantic consistency loss, and updating model parameters based on the total loss function; And S6, repeating the steps S1 to S5 to perform iterative training to obtain a trained joint embedded prediction model, and applying the joint embedded prediction model to perform classification tasks.
2. The method according to claim 1, wherein in the step S1, the enhanced view X' is obtained by performing at least one data enhancement operation including color dithering, rotation, random cropping, random horizontal flipping, or gaussian blur on the original view X.
3. The method according to claim 1, wherein in step S1, the original view X and the enhanced view X' generate a context index set indicating a visible region and a target index set indicating an occluded region by a random mask or a block mask strategy, respectively.
4. The method according to claim 1, wherein in the step S2, the original view X and the enhanced view X' are input to a target encoder, respectively, and a real target representation is extracted from the target index set; and inputting the image areas designated by the context index set in the original view X and the enhanced view X' to a context Wen Bianma device to obtain context characteristics, and inputting the context characteristics and target position information to a predictor to obtain the prediction characterization of the target area.
5. The method according to claim 1, wherein in step S2, the parameters of the target encoder are kept frozen during training or updated according to the parameters of the context encoder by means of an exponential sliding average.
6. The method according to claim 1, wherein the uncertainty-aware dynamic feature fusion in step S3 specifically comprises: Respectively estimating the prediction uncertainty corresponding to the prediction characterization and the measurement uncertainty corresponding to the real target characterization through a trainable uncertainty prediction head; calculating a fusion gain based on the prediction uncertainty and the measurement uncertainty, wherein the fusion gain is a function of the prediction uncertainty, the measurement uncertainty, and a comparison learning temperature parameter; and carrying out weighted summation on the prediction representation and the real target representation according to the fusion gain to generate the optimized target representation.
7. The method according to claim 1, wherein in the step S4, the following sub-steps are included: S4-1, aggregating the optimized target characterization to obtain a sample-level semantic feature vector, and inputting the sample-level semantic feature vector into a discrimination projection head to obtain a normalized global embedded representation; s4-2, calculating contrast learning loss based on the global embedded representation of the original view X and the enhanced view X' so as to enhance the discriminant among different samples; S4-3, calculating prediction consistency loss based on the difference between the optimized target representation and the real target representation; S4-4, calculating local-global semantic consistency loss based on the sample-level semantic feature vector and the intermediate vector output by the discrimination projection head so as to restrict global embedded representation to be faithful to local prediction semantics.
8. The method according to claim 7, wherein in the step S5, a total loss function is constructed according to a weighted sum of the prediction consistency loss, the contrast learning loss and the local-global semantic consistency loss, and parameters of the context encoder, the predictor and the discriminating projection head are updated based on the total loss function.
9. The method according to claim 7, wherein in step S4-3, the predicted consistency loss is calculated using a smoth L1 loss function, which is the average of smoth L1 norms of the differences between the optimized target representation and the real target representation over all targets in the batch.
10. The method of claim 7, wherein in the step S4-1, the aggregation is an average pooling operation along a target dimension, and the discriminating projection head comprises at least one affine transformation layer and a nonlinear activation function for mapping the sample-level semantic feature vector to a contrast learning space and performing L2 norm normalization to obtain the global embedded representation of a unit length.
11. The method of claim 7, wherein in the step S4-2, the contrast learning penalty is calculated using a InfoNCE penalty function, which is constructed by taking global embedded representations of different views of the same image as positive sample pairs, taking global embedded representations of different images within a batch as negative samples, and taking the negative logarithm of the ratio of the cosine similarity between the positive sample pairs to the exponentially weighted sum of all negative sample pairs cosine similarity as a penalty value.
12. The method according to claim 7, wherein in the step S4-4, the local-global semantic consistency loss is obtained by calculating a mean square error between the sample-level semantic feature vector and a vector linearly mapped and an intermediate vector output by the discrimination projection head before normalization.
13. The method according to claim 1, wherein in the step S5, the total loss function is a sum of the predicted consistency loss, the contrast learning loss and the local-global semantic consistency loss according to a preset weight coefficient.

Description

Image classification method based on joint embedded prediction architecture of local perception and global alignment Technical Field The invention relates to the technical field of computer vision and deep learning, in particular to an image classification method based on a joint embedded prediction architecture with local perception and global alignment, which is mainly applicable to self-supervision vision characterization learning tasks and can be used for feature pre-training of image classification. Background In recent years, a self-supervision learning method based on a deep neural network has made remarkable progress in the field of computer vision, and the aim is to learn a migratable general visual representation from large-scale image data without manual labeling. The existing mainstream self-supervision learning method mainly develops along two technical routes, namely a predictive method and a comparative method. Predictive self-monitoring methods (e.g., I-JEPA, i.e., image joint embedded prediction architecture) enable models to predict semantic representations of occluded regions (objects) from partially visible image regions (contexts) by constructing context prediction tasks. The method predicts in the embedded space, avoids the problem of high-frequency noise caused by directly reconstructing the pixel level, and can effectively learn the local semantic structure and the content correlation of the image. However, such methods are usually focused on modeling and semantic recovery of local contexts, and the learning objective of such methods does not explicitly encourage models to learn global feature representations with high discriminant, resulting in improved performance when the learned features are directly applied to downstream tasks that rely on global feature discriminant power, such as image classification, retrieval, and the like. On the other hand, the contrast type self-supervision method can guide the model to learn the global embedded representation which is robust to image transformation and has high discrimination by constructing different enhancement views of the image as positive sample pairs and enlarging the distance between the different enhancement views and other image samples. Such methods achieve excellent performance in downstream classification tasks. However, standard contrast learning methods typically work directly on the global feature vectors output by the encoder, whose learning process easily ignores local semantic structure information inside the image. More importantly, when the discriminativity between instances is too emphasized in comparison with the learning objective, the global embedded representation may be driven to gradually deviate from the original internal structure of the image, which is composed of local semantic content, so that a semantic drift phenomenon is generated, namely, the global representation has discriminativity, but loses consistency with the local content of the original image, which damages the interpretability of the representation and the generalization capability of the representation on fine-granularity understanding tasks. Therefore, the prior art has a prominent contradiction that the predictive method is good at learning the local semantic authenticity but has insufficient global discriminant, and the contrast method can obtain the global representation with strong discriminant but easily sacrifice the local semantic consistency. Under the self-supervision learning framework, the advantages of the prediction task and the comparison task are cooperatively utilized, so that the model can learn accurate local semantic representation through a prediction mechanism, can obtain global embedding with strong discriminant through a comparison mechanism, and can ensure that global discriminant features are strictly built on the basis of local prediction semantics, thereby avoiding semantic drift, and being the technical problem to be solved urgently at present. Disclosure of Invention The invention aims to improve the local semantic prediction capability and the global embedding discrimination capability of a model on the premise of only relying on image samples and not using manual labels under an unsupervised/self-supervised visual representation learning scene, and relieve the problems that global embedding deviates from a local prediction semantic structure and causes semantic drift possibly occurring after combined introduction of contrast learning, and provides an image classification method based on a combined embedding prediction framework with local perception and global alignment, wherein an embedding space prediction mechanism based on I-JEPA introduces a view contrast enhancement mode based on prediction features, and increases regularization constraint of local-global semantic consistency, so that global embedding can faithfully reflect semantic information learned by local prediction tasks, thereby improving the sta