CN-121459358-B - OCR recognition method based on large model enhancement

CN121459358BCN 121459358 BCN121459358 BCN 121459358BCN-121459358-B

Abstract

The invention relates to the technical field of artificial intelligence and computer vision, in particular to an OCR recognition method based on large model enhancement, which generates a directional texture feature map by performing multi-directional differential operation and smoothing treatment on an input image; the method comprises the steps of carrying out multi-scale feature extraction and self-adaptive weighting fusion on the multi-scale feature extraction to generate a multi-scale aggregation feature map, calculating energy statistics to generate space gating weight, carrying out weighting enhancement on the aggregation feature map to obtain a space enhancement feature map, carrying out feature transformation on the space enhancement feature map and a direction texture feature map, fusing deep and shallow features to generate unified feature representation, reconstructing character morphology on visual branches based on the representation, carrying out semantic reasoning on language branches, and fusing double branch results to obtain target recognition results. According to the invention, through directional texture modeling, self-adaptive multi-scale fusion and vision-language dual-branch cooperation, the accuracy and the robustness of OCR recognition under a complex scene are remarkably improved.

Inventors

YANG JINGYU
ZHU YEDONG

Assignees

北京中科金财科技股份有限公司

Dates

Publication Date: 20260505
Application Date: 20251209

Claims (10)

1. An OCR recognition method based on large model enhancement, comprising: Step S1, performing multi-direction differential operation and smoothing processing on an input image to generate a direction texture feature map; S2, carrying out multi-scale feature extraction on the directional texture feature map, and adopting self-adaptive weights to carry out weighted fusion on different scale features so as to generate a multi-scale aggregate feature map; S3, calculating energy statistics of the multi-scale aggregation feature map, generating space gating weights based on the energy statistics, and carrying out weighted enhancement on the multi-scale aggregation feature map by utilizing the space gating weights to obtain a space enhancement feature map; s4, respectively carrying out feature transformation on the space enhancement feature map and the direction texture feature map, and combining deep and shallow features through fusion parameters to generate unified feature representation; And S5, reconstructing character forms in the visual branches based on the unified feature representation, performing semantic reasoning in the language branches by using a language model, and obtaining a target recognition result after fusing the visual reconstruction result and the language reasoning result.
2. The OCR recognition method based on large model enhancement according to claim 1, the method is characterized in that the process of the step S1 comprises the following steps: , Wherein, the A directional texture feature map is represented and is displayed, Representing the base eigenvalue output at coordinates (x, y); representing pixel intensities of the input image at coordinates (x, y); an absolute value response representing the difference in the horizontal direction; an absolute value response representing the difference in the vertical direction; mean (& gt) represents a local mean operator, averaging the input scalar field in a finite neighborhood centered on (x, y); 、 And Three directional weight coefficients are used to adjust the contribution of the horizontal, vertical and principal diagonal responses in the final texture energy.
3. The large model enhancement based OCR recognition method of claim 2, wherein the direction weight coefficients are learnable parameters and a sum of the direction weight coefficients is made equal to 1 by applying a normalization constraint to the direction weight coefficients to avoid a direction response bias and maintain a numerical stability.
4. A method of large model enhanced OCR recognition according to claim 3, wherein the process of step S2 comprises: Inputting the directional texture feature map to a plurality of scale branches in parallel, and executing convolution operation by using sensing kernels with different sizes on each branch to obtain response features with corresponding scales; expanding and aligning the response characteristics of each branch in the channel dimension to obtain channel alignment characteristics; extracting scale selection confidence from the response characteristics through a weight prediction mechanism, and generating dynamic weights related to each spatial position; And carrying out weighted summation on the channel alignment feature and the dynamic weight to generate the multi-scale aggregation feature map.
5. The large model enhanced OCR recognition method of claim 4, wherein the process of weighted summing the channel alignment features and the dynamic weights to generate the multi-scale aggregated feature map comprises: , Wherein, the The scale response obtained by convolving the directional texture feature map with the region sensing kernel Ki of the ith scale is represented; representing the i-th layer region susceptor; representing the scaling or alignment of tensors X in the channel dimension, Representing the channel expansion proportion or the target channel alignment coefficient, and usually being a positive integer or the target channel number; representing the dynamic weight of the ith scale branch; And F2 represents a multi-scale aggregation characteristic diagram.
6. The OCR recognition method based on large model enhancement according to claim 5, the method is characterized in that the process of the step S3 comprises the following steps: Calculating average energy statistic and fluctuation energy statistic for the multi-scale aggregation feature map; performing linear weighted fusion on the average energy statistic and the fluctuation energy statistic according to a leachable energy proportionality coefficient to obtain an energy indication map; Applying sigmoid nonlinear activation to the energy indication map to generate a space gating weight map with a value range of [0,1 ]; and multiplying the space gating weight graph with the multi-scale aggregation feature graph element by element to obtain the space enhancement feature graph.
7. The method of claim 6, wherein the multiplying the spatial gating weight map by the multi-scale aggregated feature map element by element to obtain the spatial enhancement feature map comprises: , Wherein, the Representing vectors or scalar values of the input feature map output by the region interaction aggregation unit at coordinates (x, y); Representation pair Performing an average pooling result; Representation pair Performing standard deviation pooling; The ratio of the global average energy term is used for adjusting the influence intensity of avgpool (F2) on the gating, the ratio of the local fluctuation energy term is used for adjusting the influence intensity of stdpool (F2), sigmoid (i.e. sigmoid) is used for representing S-shaped nonlinear mapping, and the multiplication by element is used for representing element-by-element; representing the values of the spatial enhancement feature map at coordinates (x, y).
8. The OCR recognition method based on large model enhancement as claimed in claim 7, the method is characterized in that the process of the step S4 comprises the following steps: Performing convolution mapping on the space enhancement feature map, and applying tanh nonlinear suppression on a convolution result to obtain deep structural features; Performing convolution mapping on the direction texture feature map, and performing normalization processing on a convolution result to obtain shallow detail features; And carrying out convex combination on the deep structural features and the shallow detail features through the learnable fusion parameters to generate the unified feature representation.
9. The large model-based enhanced OCR recognition method of claim 8, wherein the process of convex combining the deep structural features with the shallow detail features by a learnable fusion parameter to generate the unified feature representation comprises: , Wherein, the Representation pair Applying convolution weights Wd is a learnable parameter, tan h () represents hyperbolic tangent nonlinearity, the output value range is [ -1, 1]; Representation pair Applying convolution weights Ws is a learnable parameter, norm (lambda) represents hierarchical normalization operation, lambda represents fusion proportion parameter, lambda is equal to or more than 0 and equal to or less than 1, and (1-lambda) represents complementary weight of detail branches; Representing a unified feature representation.
10. The OCR recognition method based on large model enhancement according to claim 9, the method is characterized in that the process of the step S5 comprises the following steps: in the visual branch, deconvolution mapping is carried out on the unified feature representation to obtain high-resolution reconstruction features, morphological correction processing is applied to the high-resolution reconstruction features, and visual candidate results are generated; In language branches, embedding and mapping the unified feature representation to convert the unified feature representation into a sequence representation, inputting the sequence representation into a large language model for up-down Wen Yuyi reasoning, and generating a semantic candidate result; And fusing the visual candidate result and the semantic candidate result through a confidence weighting mechanism to generate the target recognition result.

Description

OCR recognition method based on large model enhancement Technical Field The invention relates to the technical field of artificial intelligence and computer vision, in particular to an OCR recognition method based on large model enhancement. Background The existing OCR mostly adopts two types of main lines, namely a pipeline taking convolutional feature+sequence modeling as a core (such as CNN/CRNN+CTC or attention decoding), is often assisted by a fixed or lightweight super-resolution and post-processing dictionary, and the other is a transform OCR/document understanding model taking self-attention as a core, and performs context correction by combining multi-scale features and a language model. In order to improve the definition, a method is provided for superposing an image enhancement or super-resolution network at the front end, and in order to improve the robustness, a method is provided for introducing channel/space attention, layout prior and dictionary constraint. However, these schemes generally take general features to unified decoding as the main, and lack a refined coupling design for word morphology and semantic synergy. Problems of the prior art: 1. The direction is insensitive to the thin strokes, and the lower edge with low definition and the left-falling stroke are easy to lose. 2. Insufficient scale adaptation, and over/under segmentation easily occurs when thick and thin fonts are mixed or adhered. 3. Under the conditions of low illumination, complex shading and noise, background high-frequency interference is amplified, so that false edges and false detection are caused. 4. The depth layer characteristics are subjected to statistics mismatch, and numerical value conflicts are easy to directly add. 5. The method is easy to make mistakes only by language prior or only by visual evidence, wherein the former can be semantically passed but the form is wrong, and the latter can be shaped like but the reading is not passed. 6. The lack of auditable and controllable error correction links makes it difficult to trace to source and iteratively optimize in an industrial process. Disclosure of Invention Accordingly, the present invention is directed to a method for OCR recognition based on large model enhancement, which solves the foregoing problems in the prior art. In order to achieve the above object, the present invention provides an OCR recognition method based on large model enhancement, comprising: Step S1, performing multi-direction differential operation and smoothing processing on an input image to generate a direction texture feature map; S2, carrying out multi-scale feature extraction on the directional texture feature map, and adopting self-adaptive weights to carry out weighted fusion on different scale features so as to generate a multi-scale aggregate feature map; S3, calculating energy statistics of the multi-scale aggregation feature map, generating space gating weights based on the energy statistics, and carrying out weighted enhancement on the multi-scale aggregation feature map by utilizing the space gating weights to obtain a space enhancement feature map; s4, respectively carrying out feature transformation on the space enhancement feature map and the direction texture feature map, and combining deep and shallow features through fusion parameters to generate unified feature representation; And S5, reconstructing character forms in the visual branches based on the unified feature representation, performing semantic reasoning in the language branches by using a language model, and obtaining a target recognition result after fusing the visual reconstruction result and the language reasoning result. Further, the process of step S1 includes: , Wherein, the A directional texture feature map is represented and is displayed,Representing the base eigenvalue output at coordinates (x, y); representing pixel intensities of the input image at coordinates (x, y); an absolute value response representing the difference in the horizontal direction; an absolute value response representing the difference in the vertical direction; mean (& gt) represents a local mean operator, averaging the input scalar field in a finite neighborhood centered on (x, y); 、 And Three directional weight coefficients are used to adjust the contribution of the horizontal, vertical and diagonal responses in the final texture energy. Further, the direction weight coefficients are learnable parameters, and by applying normalization constraint to each direction weight coefficient, the sum of each direction weight coefficient is equal to 1, so as to avoid direction response bias and maintain numerical stability. Further, the step S2 includes: Inputting the directional texture feature map to a plurality of scale branches in parallel, and executing convolution operation by using sensing kernels with different sizes on each branch to obtain response features with corresponding scales; expanding and aligning the response