Search

CN-122023999-A - Online distillation-based universal visual encoder-decoder pre-training method

CN122023999ACN 122023999 ACN122023999 ACN 122023999ACN-122023999-A

Abstract

The invention provides a universal visual encoder-decoder pre-training method based on online distillation, which is called GLID and aims to improve the performance of a model in various downstream visual tasks by minimizing the difference between pre-training and fine-tuning architecture. The method uses the mask image modeling in the pre-training stage, and only replaces the topmost linear transformation layer to adapt to a specific task when in fine tuning, so that the requirements on a task specific architecture are reduced. GLID provides high-level semantic information by online distillation, focuses on more high-level semantic information, and compared with the traditional method for only predicting low-level semantic information, the performance of a model after pre-training is obviously improved. Experimental results show that GLID obtains performance comparable to or better than that of a professional model on tasks such as target detection, image segmentation, attitude estimation and depth estimation, and the like, thereby proving the wide application potential and high efficiency of the GLID in the field of computer vision.

Inventors

  • DOU QI
  • LI HONGSHENG
  • LIU JIHAO

Assignees

  • 香港中文大学

Dates

Publication Date
20260512
Application Date
20241031

Claims (15)

  1. 1. A universal visual encoder-decoder pre-training method based on-line distillation, comprising the steps of: s1, receiving an input image, dividing the input image into image patches and converting the image patches into image tokens, wherein each image token represents one part of the input image; S2, applying a random mask to the image token, and selectively masking part of the image token to create a mask image token and a visible image token; S3, processing the visible image token by using a transducer encoder, extracting image features and outputting feature representations; s4, a transducer decoder receives the characteristic representation output by the encoder, and reconstructs image characteristics by using a query and decoding mechanism, wherein the input query comprises the position information of the mask image token and an additional [ CLS ] token for capturing the global representation so as to guide the decoder to reconstruct the characteristics of the masked area together; s5, converting the characteristic representation output by the decoder into a predicted pixel value through a pre-training linear transformation layer, and reconstructing the position of a mask image token; S6, calculating loss between a predicted pixel value and an original pixel value corresponding to a mask image token by using the characteristics output by a pre-trained teacher network as a target to train a transducer encoder and a decoder serving as a student network, wherein the teacher network provides target characteristics containing high-level semantic information through online distillation to guide training of the student network, so that the model can pay attention to more high-level semantic information, and the pre-training effect is improved; s7, after the pre-training is completed, replacing the pre-training linear transformation layer with a task specific linear head for the fine adjustment of the downstream task so as to adapt to the requirements of the specific task, and keeping the parameters of a transducer encoder and a decoder unchanged in the fine adjustment process, and only updating the parameters of the task specific linear head so as to minimize the architecture difference between the pre-training and the fine adjustment.
  2. 2. The online distillation based universal vision encoder-decoder pre-training method of claim 1, wherein the parameters of the teacher network are updated by means of a moving average during online distillation.
  3. 3. The universal on-line distillation based visual encoder-decoder pre-training method according to claim 1 or 2, wherein the transducer encoder is a multi-scale encoder architecture for extracting visual features of different scales.
  4. 4. A universal online distillation based visual encoder-decoder pre-training method as claimed in any one of claims 1 to 3, wherein the pre-training task and the downstream task are modeled collectively as a "query-answer" question, wherein in the pre-training task, the "query" is the location of the mask, the "answer" is the feature provided by the online teacher network corresponding to the location of the mask, and in the downstream task, the "query" is the question corresponding to the downstream task, and the "answer" is the target of the downstream task.
  5. 5. The online distillation based universal vision encoder-decoder pre-training method of claim 4, wherein the fransformer decoder comprises a cross-attention mechanism for decoding "answers" to different tasks.
  6. 6. The online distillation based universal visual encoder-decoder pre-training method of any of claims 1 to 5, wherein in step S4, a number of mask tokens are introduced as input queries to the decoder and different location information embedments are added thereto to indicate unique masked locations, while an additional [ CLS ] token is added to the query, which token is not associated with a specific masked location, but is used to capture a global representation.
  7. 7. The online distillation based universal vision encoder-decoder pre-training method of claim 6, wherein in step S4, the query is initialized to include multiple copies of [ CLS ] tokens, each copy being used to capture a global representation of the input image, a learning embedment is added to each copy of [ CLS ] tokens to enable the model to distinguish between different queries and accommodate various downstream tasks, and the initialized query is input to a transducer decoder for use with the encoder output feature representation to direct the decoder to reconstruct the features of the masked region.
  8. 8. The online distillation based universal vision encoder-decoder pre-training method of any one of claims 1 to 7, wherein in step S6, a Mean Square Error (MSE) penalty between the predicted pixel value P and the mask image token corresponding original pixel value Pt is calculated using the characteristics of the pre-trained teacher network output as a target.
  9. 9. The universal visual encoder-decoder pre-training method based on-line distillation according to any one of claims 1 to 8, wherein in step S7, said task-specific linear head is specifically designed for at least one of the following downstream tasks in a fine tuning phase: target detection, wherein the linear head outputs a bounding box and a class probability; Pose estimation, wherein the linear head outputs a heat map of keypoints; Depth estimation, wherein the linear head outputs a depth map; image segmentation, wherein the linear head outputs a segmentation mask.
  10. 10. The on-line distillation based universal vision encoder-decoder pre-training method of claim 9, For the target detection task, specifically including: -each query is used to represent an object instance; -converting hidden features of the transform decoder into bounding boxes and class probabilities using two linear layers, following the design principle of DETR; for the image segmentation task, specifically including: each query is specifically designed to predict one C-dimensional mask embedding and its class; Dot product is carried out through Mask embedding and a 1/4-size feature map of a backbone network, binary Mask prediction is obtained through sigmoid activation, and Mask2Former design is followed; For depth estimation tasks, specifically include: the depth regression task is modeled as a "classification-regression" problem, with continuous predictions obtained by linear combinations of bin centers; The hidden features of the query are converted to bin lengths and bin embeddings by the depth header, following the designs of AdaBins and BinsFormer; -obtaining a probability distribution map through a softmax function by dot product of the 1/4-size feature map of the tank-embedded and backbone network, and finally obtaining a depth map by linear combination of the tank centers; for the task of attitude estimation, specifically include: -employing a heat map based approach, each query being specifically designed to output a heat map of key points; -a method of converting query features into C-dimensional feature vectors using a gesture pre-probe to represent heat maps of different keypoints, reference ViTPose.
  11. 11. The universal on-line distillation based visual encoder-decoder pre-training method according to any one of claims 1 to 10, wherein the fine tuning process is optimized using the following loss function: for the target detection task and the image segmentation task, adopting binary matching loss; For depth estimation tasks, scale-invariant regression loss is adopted; for the attitude estimation task, a smooth L1 loss is adopted; When in fine tuning, the pre-training weight is reserved, and only the linear head parameters specific to the task are adjusted.
  12. 12. The online distillation based universal visual encoder-decoder pre-training method according to any one of claims 1 to 11, wherein the Transformer encoder is used as a Backbone network Backbone to process an input image, extract deep feature representation, enhance feature integration through BiFPN structures, improve the fusion efficiency of multi-scale features, reconstruct image features based on the output of the encoder by using the Transformer decoder and a query mechanism associated with the Transformer decoder, and enhance the understanding and reconstruction capability of a model on the input image.
  13. 13. The universal on-line distillation based visual encoder-decoder pre-training method according to any one of claims 1 to 12, wherein the transducer encoder uses Swin transducer as visual encoder.
  14. 14. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the universal on-line distillation based visual encoder-decoder pre-training method of any of claims 1 to 13.
  15. 15. A computer program product comprising a computer program which, when executed by a processor, implements the universal on-line distillation based visual encoder-decoder pre-training method according to any one of claims 1 to 13.

Description

Online distillation-based universal visual encoder-decoder pre-training method Technical Field The invention relates to computer vision, in particular to a universal vision encoder-decoder pre-training method based on online distillation. Background Pre-training of self-supervised large scale unlabeled images has achieved great success in visual characterization learning. Representative occlusion image modeling (MIM) methods show that the pre-trained model has significant scale capability and can significantly improve performance of downstream visual tasks, including image classification, object detection, image segmentation, pose estimation, depth estimation, and the like. Despite significant advances in MIM pre-training, existing approaches face architectural gaps between upstream pre-training and downstream trimming. In particular, the existing methods focus mainly on pre-training the visual trunk. However, to address the downstream visual tasks described above, there is still a need for task-specific sub-architectures, such as new decoders for object detection and split decoders for panorama splitting. These task-specific sub-architectures are complex and must be trained de novo on downstream tasks, and thus cannot enjoy the benefits of extensive pre-training. This design lags behind the progress of natural language processing, which can process pre-training tasks and downstream tasks using the same architecture with minimal differences. Recently, research based on a general architecture has been significantly advanced for diversified visual tasks, and competitive performance can be achieved compared with task-specific professional models. However, only the visual backbone is pre-trained in these generic architectures, while heavy duty decoders still need to be trained de novo on downstream tasks. Thus, these general methods typically require a large amount of task-specific data to achieve satisfactory performance. It should be noted that the information disclosed in the above background section is only for understanding the background of the application and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art. Disclosure of Invention The main object of the present invention is to overcome the drawbacks of the prior art described above, and to provide a universal visual encoder-decoder pre-training method based on-line distillation. In order to achieve the above purpose, the present invention adopts the following technical scheme: a universal visual encoder-decoder pre-training method based on-line distillation, comprising the steps of: s1, receiving an input image, dividing the input image into image patches and converting the image patches into image tokens, wherein each image token represents one part of the input image; S2, applying a random mask to the image token, and selectively masking part of the image token to create a mask image token and a visible image token; S3, processing the visible image token by using a transducer encoder, extracting image features and outputting feature representations; S4, a transducer decoder receives the characteristic representation output by the encoder, and reconstructs image characteristics by utilizing query (queries) and a decoding mechanism, wherein the input query comprises position information of mask image tokens and an additional [ CLS ] token for capturing global representation so as to guide the decoder to reconstruct the characteristics of the masked area together; s5, converting the characteristic representation output by the decoder into a predicted pixel value through a pre-training linear transformation layer, and reconstructing the position of a mask image token; S6, calculating loss between a predicted pixel value and an original pixel value corresponding to a mask image token by using the characteristics output by a pre-trained teacher network as a target to train a transducer encoder and a decoder serving as a student network, wherein the teacher network provides target characteristics containing high-level semantic information through online distillation to guide training of the student network, so that the model can pay attention to more high-level semantic information, and the pre-training effect is improved; s7, after the pre-training is completed, replacing the pre-training linear transformation layer with a task specific linear head for the fine adjustment of the downstream task so as to adapt to the requirements of the specific task, and keeping the parameters of a transducer encoder and a decoder unchanged in the fine adjustment process, and only updating the parameters of the task specific linear head so as to minimize the architecture difference between the pre-training and the fine adjustment. A computer readable storage medium storing a computer program which when executed by a processor implements the universal on-line distillation based visual encoder-decoder pr