CN-121981918-A - Large-model-oriented dual-stage document image universal restoration method and system

CN121981918ACN 121981918 ACN121981918 ACN 121981918ACN-121981918-A

Abstract

The invention discloses a large-model-oriented dual-stage document image general restoration method and system, which adopt a dual-stage restoration model to carry out image restoration, and the method comprises a space correction stage and a pixel correction stage, wherein in the space correction stage, a degraded image is processed through a dual-branch structure of a layout branch and a text line branch, wherein the layout branch predicts a sparse two-dimensional mapping field and a three-dimensional coordinate grid to restore a global geometry structure, the text line branch generates a local modulation signal based on text region characteristics extracted by a pre-training text segmentation model, global and local information are fused, a dense two-dimensional mapping field is generated, a space correction image is output by utilizing a reverse sampling operation, and in the pixel correction stage, the space correction image and a gradient image thereof are used as conditions to be input into a denoising diffusion implicit model architecture to gradually restore pixel values. The invention completes the general restoration of various degradation problems through the double-stage restoration, thereby improving the content reading capability of various large models on document images.

Inventors

XUE HUI
ZHU CHENJIE
ZHU SHIPENG

Assignees

东南大学

Dates

Publication Date: 20260505
Application Date: 20260113

Claims (7)

1. The large-model-oriented dual-stage document image general restoration method is characterized by adopting a dual-stage restoration model to carry out image restoration, and at least comprising a space correction stage and a pixel correction stage: The space correction stage comprises the steps of spatially correcting a degraded image through double branches, wherein the double branches comprise layout branches and text line branches, the layout branches are used for predicting a sparse two-dimensional mapping field and a three-dimensional coordinate grid to recover a global geometric structure, and the text line branches are used for extracting text region characteristics to generate local modulation signals; The pixel correction stage takes the spatial correction image and the gradient image thereof output by the spatial correction stage as conditions to input a denoising diffusion implicit model, and introduces a frequency domain analysis mechanism to perform pixel correction; The models of the spatial correction stage and the pixel correction stage are independently trained on corresponding pre-processed data sets respectively, and can be used alternatively or in combination.
2. The method for repairing a large model-oriented dual-stage document image according to claim 1, wherein the repairing of the spatial correction stage comprises the following steps: s101, for inputting degraded document image Feature extraction by CNN encoder Features to be extracted Inputting the text line branches into the layout branches and the text line branches for processing; S102, in the layout branches, enhancing receptive fields through an expansion pyramid extractor, and respectively predicting sparse 2D mapping fields through a 2D decoder And 3D decoder prediction sparse 3D grid Wherein the 3D mesh represents three-dimensional coordinates of each control point in the degraded image; s103, in the text line branch, acquiring text segmentation characteristics by using a pre-training text segmentation model Text segmentation features through an attention mechanism Features extracted from step S101 The fusion is specifically as follows: ; Where SA represents the self-attention layer, CA represents the cross-attention layer, FFN represents the feed-forward network; obtaining modulation characteristics through N self-attention decoder optimization characteristics Wherein the j-th decoder processes: ; sparse 2D mapping field And modulation characteristics Fusion to generate dense 2D mapping field ; S104, inputting images through inverse sampling operation Applying the dense 2D mapping field obtained in step S103 Generating spatially corrected images 。
3. The method for general restoration of large model-oriented dual-stage document images as set forth in claim 2, wherein in said step S103, a sparse 2D mapping field is generated And modulation characteristics Fusion to generate dense 2D mapping field The method comprises the steps of firstly mapping a sparse 2D mapping field Bilinear interpolation upsampling and reuse of modulation characteristics And carrying out weighted modulation on the up-sampled mapping field.
4. The method for general restoration of large model-oriented dual-stage document images as set forth in claim 1, wherein said restoration of said pixel correction stage comprises the steps of: S201, outputting an image in a space correction stage through a Sobel operator Calculating its gradient map ; S202, image is processed And gradient map Based on the implicit model architecture of denoising diffusion as a condition input, a diffusion sampling sequence is defined in the reasoning process Wherein the total number of steps , Initializing noisy images for the original number of diffusion steps Step-wise denoising by the iterative process: ; Wherein the method comprises the steps of In the case of a U-Net network, Representing model parameters.
5. The method for general restoration of large model-oriented dual-stage document images as set forth in claim 1, wherein in said spatial correction stage, a model trains a loss function The L1 loss weighting composition of the sparse two-dimensional mapping field, the three-dimensional grid and the dense mapping field comprises the following concrete steps: ; Wherein the method comprises the steps of Representing the predicted sparse 2D mapped field, A sparse 3D grid representing the predictions is presented, Representing the predicted dense 2D mapped field.
6. The method for general restoration of large model-oriented dual-stage document images as set forth in claim 5, wherein in said pixel correction stage, a model-trained loss function is used The pixel level L1 loss and temporal frequency switching loss weighting structure comprises the following specific steps: ; Wherein the method comprises the steps of For the output image of the pixel correction stage, For temporal frequency switching loss, the corrected image is separated into low frequency components by a Fourier decoupler And a high frequency component And according to the diffusion time steps Dynamically adjusting weights: ; ; Wherein the method comprises the steps of The number of total diffusion steps is indicated, And Respectively corresponding real low frequency and high frequency components, For the weight of the low-frequency component, Is the weight of the high frequency component.
7. The large-model-oriented dual-stage document image general restoration system is characterized by at least comprising a space correction module and a pixel correction module, The space correction module is a double-branch space correction module, wherein the double branches comprise a layout branch and a text line branch, the layout branch is used for predicting a sparse two-dimensional mapping field and a three-dimensional coordinate grid so as to recover a global geometric structure, the text line branch is used for generating a local modulation signal based on text region characteristics extracted by a pre-training text segmentation model, and the space correction module is used for fusing global and local information to generate a dense two-dimensional mapping field and outputting a space correction image by utilizing a reverse sampling operation; The pixel correction module takes the space correction image and the gradient diagram thereof as conditions to be input into the pixel correction module for pixel correction, and the pixel correction module is based on a denoising diffusion implicit model framework, wherein the total loss is formed by weighting pixel level L1 loss and temporal frequency switching loss, the temporal frequency switching loss separates high-frequency and low-frequency components of the image through a Fourier decoupler, and corresponding frequency band constraint is dynamically activated according to the current diffusion time step.

Description

Large-model-oriented dual-stage document image universal restoration method and system Technical Field The invention belongs to the technical field of image processing and restoration, relates to a document image restoration technology, and mainly relates to a large-model-oriented dual-stage document image general restoration method and system. Background With the popularization of mobile devices and the rapid development of large model technology, capturing document images by using a mobile phone camera and delivering the document images to a Large Language Model (LLMs) or a multi-mode large model (MLLMs) for content analysis has become a common scene in daily office, education and research. However, digital document images captured in real scenes are often affected by various forms of degradation, including distortion and spatial misalignment of lines of text caused by physical distortion caused by bending or tilting of paper, and problems of shading, overexposure, underexposure, blurring, noise, etc. caused by light unevenness. These degradations not only affect human visual experience, but also severely restrict understanding and parsing capabilities of the downstream large model for document content, which constitutes a significant challenge for the fields of finance, law, medical treatment, etc. that rely on accurate document understanding. Therefore, the development of the document image restoration method capable of specially improving the reading capacity of the large model has urgent practical significance. Currently, the prior art is largely divided into two categories, a task-specific document enhancement method and a unified document recovery framework. In the aspect of a task-specific method, a processing technology aiming at spatial deformation mainly utilizes a deep neural network to predict a deformation field of a pixel, geometric distortion is corrected through grid sampling, and part of the method is used for attempting to introduce text line characteristics as additional guidance so as to enhance the correction effect of a text region. For pixel disturbance, existing methods such as a shadow removing technology, an illumination correction method and an appearance enhancement technology are usually focused on specific degradation types, such as shadow, low contrast, strike-through or blurring, and the like, and cannot process complex scenes in which multiple pixel disturbances coexist. The task-specific methods need to train and deploy a plurality of independent models for different degradation types, so that not only is the resource consumption high, but also the situations that the Chinese image in the real scene simultaneously contains a plurality of degradation are difficult to deal with, and the input characteristics of the large models cannot be cooperatively optimized. In terms of unified restoration framework, some studies in recent years have attempted to deal with complex degradation patterns within a unified framework. These methods attempt to handle both geometric distortion and illumination correction simultaneously, fine-grained adjustment is achieved by dividing the entire image into small pieces during testing. Recent approaches consider the use of dynamic cues to handle five different enhancement tasks in a single model that shares weights. However, these methods still have significant limitations in that they often require multiple iterative enhancements during the inference process and rely on manual assignment of specific degradation types, which severely limits their convenience and automation in practical applications, on the one hand, and in that they fail to analyze and solve degradation problems from digital images in a intrinsically systematic manner, have limited processing power for complex degradation modes in which spatial deformations coexist with pixel perturbations, and completely neglect the impact of repair results on the reading capability of large models. It is particularly notable that the evaluation criteria of the prior art mainly focus on visual quality indexes such as PSNR, SSIM, and the like, rather than the understanding accuracy of the large model to the repaired document, which results in that even a visually good repair result may not be able to effectively improve the reading performance of the large model. In addition, the lack of high-quality training data also restricts the development of document repair technology oriented to large-model optimization. Existing datasets either contain only a single type of degradation or are not sufficiently diverse in terms of coupling of spatial deformations and pixel perturbations. More importantly, the lack of a specialized dataset that promotes reading capability of a large model as an evaluation criterion makes it difficult for researchers to systematically optimize repair algorithms to accommodate the input needs of a large model. In summary, the main problems faced by the prior art include (1) la