CN-122023156-A - Image processing method and system based on mask automatic encoder self-supervision pre-training

CN122023156ACN 122023156 ACN122023156 ACN 122023156ACN-122023156-A

Abstract

The application provides an image processing method and system based on self-supervision pre-training of a mask automatic encoder, wherein the method comprises the steps of inputting a training set degradation image into a preset mask automatic encoder facing image processing to carry out random masking; the method comprises the steps of inputting mask images of a training set degradation image into a preset CSformer model, pre-training the preset CSformer model to determine a trained CSformer model, optimizing parameters of the preset CSformer model by adopting a preset Charbonnier loss function according to the reconstructed training set degradation image and the training set degradation image to determine an optimized CSformer model, and inputting the degradation image to be processed into the optimized CSformer model to determine a reconstructed image. The application improves the processing efficiency and the processing performance of the low-vision task and realizes stronger robustness and higher image reconstruction quality.

Inventors

SUN NINGYU
YANG XIAOKANG
Yan Diechao
DUAN HUIYU
FU KANG

Assignees

上海交通大学

Dates

Publication Date: 20260512
Application Date: 20260120

Claims (10)

1. An image processing method based on self-supervision pre-training of a mask automatic encoder, comprising: Inputting a training set degradation image into a preset mask automatic encoder facing image processing to carry out random masking, and determining a mask image of the training set degradation image; inputting a mask image of the training set degradation image into a preset CSformer model, outputting a reconstructed training set degradation image, pre-training the preset CSformer model, and determining a trained CSformer model; Optimizing parameters of the preset CSformer model by adopting a preset Charbonnier loss function according to the reconstructed training set degradation image and the training set degradation image, and determining an optimized CSformer model; and inputting the degradation image to be processed into the optimized CSformer model, and determining a reconstructed image.
2. The method for processing the image based on the self-supervision pre-training of the mask automatic encoder according to claim 1, wherein the step of inputting the training set degradation image into a preset mask automatic encoder facing the image processing to perform random masking, and determining the mask image of the training set degradation image comprises the steps of: Dividing the training set degradation image into mutually non-overlapping image blocks, and determining the image blocks of the training set degradation image; And carrying out random masking on original image pixels of the image blocks of the training set degradation image, and determining a mask image of the training set degradation image.
3. The mask-based automatic encoder self-supervised pretraining image processing method of claim 1, wherein the pre-set CSformer model comprises a first convolutional layer, an encoder-decoder network, and a second convolutional layer, wherein jump connections are set between corresponding stages of an encoder and a decoder in the encoder-decoder network, wherein each stage in the encoder-decoder network comprises a plurality of CSformer blocks, and each CSformer block comprises a layer normalization layer, a channel attention layer, a multi-head self-attention layer, a layer normalization layer, and a gated convolutional feedforward network; the first convolution layer is used for mapping the degraded image from an original pixel space to a high-dimensional feature space, extracting low-level feature embedding, the encoder-decoder network is used for reconstructing a feature map of the degraded image, and the second convolution layer is used for optimizing the feature map of the degraded image to obtain a residual image.
4. The mask-based auto-encoder self-supervised pre-training image processing method as set forth in claim 1, wherein the inputting the mask image of the training set degradation image into a preset CSformer model, outputting a reconstructed training set degradation image, pre-training the preset CSformer model, determining a trained CSformer model, comprises: Inputting a mask image of the training set degradation image into the first convolution layer, and determining low-level feature embedding of the training set degradation image; embedding low-level features of the training set degradation image into an encoder in the encoder-decoder network, determining a first feature image of the training set degradation image; Optimizing the encoder according to the mean square error loss between the original pixel value of the mask image of the training set degradation image and the reconstructed pixel value of the first characteristic image of the training set degradation image, and determining a trained encoder; loading encoder weights of the trained encoder; Inputting a mask image of the training set degradation image into the first convolution layer, and determining low-level feature embedding of the training set degradation image; embedding low-level features of the training set degradation image into an encoder in the encoder-decoder network, determining a first feature image of the training set degradation image; inputting a first feature image of the training set degradation image into a decoder in the encoder-decoder network, determining a second feature image of the training set degradation image; Inputting a second characteristic image of the training set degradation image into the second convolution layer to determine a residual image; Superposing the degradation image to be processed and the residual image to determine the reconstructed training set degradation image; Optimizing the decoder according to a root mean square error loss between the original pixel values of the mask image of the training set degradation image and the reconstructed pixel values of the reconstructed training set degradation image, determining a trained decoder, and determining a trained CSformer model.
5. A mask-based auto-encoder self-supervised pre-trained image processing method as recited in claim 3, wherein said inputting the degraded image to be processed into the optimized CSformer model determines a reconstructed image, comprising: inputting the degradation image to be processed into the first convolution layer, and determining low-level feature embedding of the degradation image to be processed; Embedding low-level features of the degraded image to be processed into an encoder in the encoder-decoder network, determining a first feature image; Inputting the first feature image into a decoder in the encoder-decoder network, determining a second feature image; Inputting the second characteristic image into the second convolution layer to determine a residual image; and superposing the degradation image to be processed and the residual image to determine the reconstructed image.
6. The mask-based automatic encoder self-supervised pre-training image processing method as recited in claim 5, wherein embedding low-level features of the degraded image to be processed into an encoder input into the encoder-decoder network, determining a first feature image, comprises: performing step-by-step downsampling encoding processing on the low-level feature embedding by adopting the encoder, and determining downsampling features of each level; Storing the each level downsampling feature into a skip connection between corresponding levels of the encoder and the decoder; Performing shuffling operation after the downsampling encoding processing of each stage is completed, and outputting the first characteristic image by the last stage of the encoder; Said inputting said first feature image into a decoder in said encoder-decoder network, determining a second feature image, comprising: Performing step-by-step up-sampling decoding processing on the first characteristic image by adopting the decoder, and determining up-sampling characteristics of each step; And performing splicing processing on the up-sampling feature of each stage and the down-sampling feature of each stage stored in the jump connection between the encoder and the corresponding stage of the decoder, wherein the last stage of the decoder outputs the second feature image.
7. The mask-based auto-encoder self-supervised pre-training image processing method as set forth in claim 1, wherein the preset Charbonnier loss function is: ; Wherein L represents Charbonnier losses, Representing the reconstructed training set degradation image, I' representing the training set degradation image, epsilon representing a constant.
8. An image processing system based on mask auto-encoder self-supervised pre-training, comprising: The random masking module is used for inputting the training set degradation image into a preset mask automatic encoder facing the image processing to carry out random masking, and determining a mask image of the training set degradation image; The self-supervision pre-training module is used for inputting a mask image of the training set degradation image into a preset CSformer model, outputting a reconstructed training set degradation image, pre-training the preset CSformer model and determining a trained CSformer model; the model optimization module is used for optimizing parameters of the preset CSformer model by adopting a preset Charbonnier loss function according to the reconstructed training set degradation image and the training set degradation image, and determining an optimized CSformer model; And the image processing module is used for inputting the degradation image to be processed into the optimized CSformer model and determining a reconstructed image.
9. A non-transitory computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor realizes the steps of the method according to any of claims 1-7.
10. An electronic device, comprising: A memory having a computer program stored thereon; A processor for executing the computer program in the memory to implement the steps of the method of any one of claims 1-7.

Description

Image processing method and system based on mask automatic encoder self-supervision pre-training Technical Field The application relates to the technical field of computer vision, in particular to an image processing method and system based on self-supervision pre-training of a mask automatic encoder. Background Image processing (including image restoration, image enhancement, etc.) has long been an important computer vision task with the goal of improving the image quality of degraded input. A good image processor can not only generate images that more closely match human visual preferences, but also improve performance of downstream computer vision tasks (e.g., recognition, detection, segmentation, etc.). Because of the discomfort of this problem, a powerful image prior is often required for effective processing. In recent years, deep learning has been widely used for image processing tasks such as rain removal, deblurring, noise removal, image enhancement, etc., and has been capable of learning a strong ability to generalize a priori from large-scale data, with leading performance. With the development of deep learning methods and the establishment of various visual references, data-driven convolutional neural network architectures have achieved the most advanced performance in various image processing tasks as compared with conventional processing methods. Architecture design plays an important role in improving performance, and many studies have developed generic or task-specific modules for various image processing applications. The encoder-decoder based U-Net architecture is widely used for image processing due to its high computational efficiency. In addition, many advanced components developed for advanced visual tasks are also introduced into low-level visual tasks and demonstrate effectiveness such as residual and dense connections, channel attention, spatial attention, multi-scale or multi-stage networks, and the like. Convolutional Neural Networks (CNNs) have been used for many years for a variety of low-level visual tasks with remarkable results. However, convolutional neural networks are typically limited to capturing long-range pixel dependencies. The transducer originally developed for Natural Language Processing (NLP) tasks has been introduced into computer vision and achieved for the most advanced performance. The transducer exhibits significant effectiveness in a variety of visual tasks including high-level vision and low-level vision. In recent years, mask auto-encoders (MAEs) for feature pre-training have further released the potential of transformers, achieving the most advanced performance on a variety of advanced visual tasks. Many transducer-based architectures have been proposed for advanced visual tasks, such as ViT, swin, PVT, etc. Recent studies on vision Transformer (ViT) explored their potential as alternatives to convolutional neural networks, considering their effectiveness in capturing long-range dependencies. Some studies have also explored the advantages of using the transducer architecture in low-level vision tasks. Like DETR, IPT applies a standard transducer encoder-decoder architecture to low-level visual tasks and proposes a low-level visual pre-training method. However, IPT relies on large-scale data sets and multitasking learning to achieve good performance and has significant computational complexity. SwinIR and UFormer employ a shift window based local attention module in low-level tasks. However, limiting spatial attention to a local window may also limit the long-range dependent capture of the entire image. Restormer proposes a multi-depth convolution head transpose attention (MDTA) block that applies spatial attention across feature dimensions to reduce computational complexity, which can implicitly learn spatial correlation, which can be considered as a special variant of channel attention. Self-supervised learning frameworks (e.g., DINO, MOCO-V3, MAE) further release ViT's potential and achieve higher performance on a variety of advanced visual tasks. Among other things, mask Auto Encoders (MAEs) demonstrate excellent learning ability and scalability over a variety of advanced visual tasks by pre-training image models from visible mark prediction mask marks. However, few studies have generalized self-supervised pre-training strategies to image processing tasks. Masking language modeling (such as BERT and GPT) is a successful approach for NLP pre-training. Some work explores masking strategies in pre-trained image models. iGPT treat the image pixels as a sequence and predict unknown pixels. BEiT and iBOT propose predicting discrete labels instead of pixels to pre-train an image model. Masking auto-encoders find that masking input images at high proportions facilitates meaningful self-supervised learning and propose an asymmetric encoder-decoder architecture to reduce pre-training time. SimMIM and ConvMAE hierarchical models (such as Swin transducer and hybri