CN-121545173-B - OCR (optical character recognition) method and system for large-model answer sheet based on detection and character segmentation

CN121545173BCN 121545173 BCN121545173 BCN 121545173BCN-121545173-B

Abstract

The invention relates to the technical field of optical character recognition, in particular to a large-model answer sheet OCR (optical character recognition) recognition method and system based on detection and character segmentation. The method comprises the steps of obtaining an answer sheet image, constructing a fine tuning data set, fine tuning a multi-mode large model based on the data set, deploying the fine-tuned model to edge equipment, executing light text detection to obtain each line of text frame area, carrying out character interval segmentation on the over-wide text line, detecting long blank areas among characters by using a vertical projection algorithm, dynamically determining segmentation points, carrying out high-resolution splicing on the segmented subgraphs according to an original sequence to generate image fragments meeting the fixed input size of the edge equipment, carrying out OCR recognition through the large model, and rearranging character level recognition results according to the original positions. The invention realizes the high-efficiency processing of the high-resolution answer sheet image on the side equipment by the large model, solves the contradiction between the limited video memory and the loss of semantic information, and remarkably improves the OCR recognition speed and the accuracy.

Inventors

DENG LEI
GONG RUIFENG
Liu Juanxi
PANG KAI
GU JING
FAN ZHIHONG
WU JINGWU

Assignees

广州像素数据技术股份有限公司

Dates

Publication Date: 20260505
Application Date: 20260121

Claims (9)

1. The large-model answer sheet OCR recognition method based on detection and character segmentation is characterized by comprising the following steps of: Acquiring an answer sheet image dataset, and constructing a fine tuning dataset for OCR (optical character recognition), wherein each sample in the fine tuning dataset comprises an answer sheet image path and corresponding text content labeling information; performing fine adjustment on the multi-mode large model based on the fine adjustment data set to obtain an OCR recognition model special for the answer sheet, converting the OCR recognition model into a format supported by the side equipment, and then deploying the OCR recognition model to the side equipment; performing light text detection on an input answer sheet image, and positioning a text box area where each line of text content in the answer sheet is located; aiming at a text box area with the width exceeding the limit of the input size of the edge equipment, carrying out character interval segmentation operation to obtain a plurality of subgraphs; The subgraphs are spliced in high resolution according to the original text sequence, and image fragments conforming to the fixed input size of the side equipment are generated; Inputting the image fragment into an OCR recognition model deployed in the side equipment to perform OCR recognition, so as to obtain a character-level recognition result; and rearranging the character-level recognition result based on the original character position information recorded by the character segmentation points, and restoring the result to an output result consistent with the original text sequence of the answer sheet.
2. The method of claim 1, wherein said constructing a fine tuning dataset for OCR recognition comprises: Pre-labeling the answer sheet image by adopting a multi-mode large model, and calling an API (application program interface) of the large model to obtain an OCR (optical character recognition) pre-labeling result; manually checking and correcting the pre-labeling result to ensure that the text content corresponds to the answer information in the image accurately; each line of text is marked as an independent text detection box, and a line feed mark is added after each line of text is finished, so that a fine adjustment data set for OCR recognition is obtained.
3. The method of claim 1, wherein the trimming the multi-modal large model based on the trimming dataset to obtain an OCR recognition model specific to an answer sheet comprises: Selecting a multi-mode large model meeting calculation force and memory constraint of the side equipment as a basic model; fine-tuning the basic model by adopting LoRA technology, and injecting trainable parameters into the attention layer and the full-connection layer of the basic model only; And inputting the fine tuning data set into the basic model, and performing model training through preset training parameters until the loss function converges to a preset threshold value, so as to obtain an OCR recognition model special for the answer sheet.
4. The method of claim 1, wherein performing lightweight text detection on the input answer sheet image locates text box areas where each line of text content in the answer sheet is located, comprises: fine tuning the lightweight detection model by adopting the fine tuning data set to obtain a lightweight text detection model; And processing the input answer sheet image by adopting a lightweight text detection model, and outputting a rectangular frame area of each line of text.
5. The method according to claim 1, wherein the performing the character interval segmentation operation for the text box area with the width exceeding the limit of the input size of the edge device to obtain a plurality of subgraphs includes: cutting a text box area obtained by text detection, and corresponding to each line on the answer sheet to obtain a cut text line image; Performing image preprocessing on the cut text line image, converting the text line image into a gray level image, and generating a binary image by adopting a self-adaptive binarization algorithm of a local window; Calculating a vertical projection value of the binary image, and performing median filtering processing on the vertical projection value; Calculating a gap threshold value based on the mean value and the standard deviation of the vertical projection values, and marking the projection values smaller than the gap threshold value as blank segments; Detecting a continuous blank interval, and screening out a continuous blank with a width not smaller than a minimum blank width, wherein the minimum blank width is a larger value in a set ratio of the set pixel to the text line width; And setting candidate segmentation points at the middle points of the continuous blank interval, reserving segmentation points with the distance between adjacent segmentation points not smaller than the set proportion of the average width of the characters, and segmenting the text line image into a plurality of subgraphs according to the segmentation points.
6. The method of claim 1, wherein the high resolution stitching of the sub-images in the original text order to generate image segments conforming to a fixed input size of a border device comprises: traversing the sub-graphs after the segmentation of each line of text, accumulating the widths of the sub-graphs, splicing the precursor sub-graphs into line segments when the accumulated widths exceed the input widths of the edge equipment, and recording the index ranges of the sub-graphs contained in the line segments; Traversing all line fragments according to the original sequence of the answer sheet, accumulating the fragment heights, and vertically splicing the front line fragments into image fragments when the accumulated heights exceed the input heights of the side equipment, wherein the heights of the image fragments do not exceed the input heights of the side equipment; And establishing a global position table, and recording the row segment ID, the starting x coordinate and the starting y coordinate in the row segment to which each sub-graph belongs.
7. The method of claim 1, wherein the rearranging the character-level recognition result based on the original character position information recorded by the character segmentation point to restore an output result consistent with the original text sequence of the answer sheet, comprises: receiving a character-level recognition result of an image fragment, wherein the recognition result comprises characters and confidence of a sub-image sequence; Mapping the sub-graph recognition result back to the original text line according to the global position table, and sequencing the sub-graph result of the same text line according to the initial x coordinate; processing boundary areas of adjacent subgraphs, selecting a higher confidence coefficient when characters are recognized on both sides of the boundary, and regarding the boundary areas as blank when recognition results are not recognized on both sides of the boundary, so as to obtain rearranged text; Correcting the rearranged text line, and when the recognition confidence is lower than the confidence threshold and the language model probability is higher, replacing the characters, and outputting a text result consistent with the original layout of the answer sheet.
8. The large-model answer sheet OCR recognition system based on detection and character segmentation is characterized by comprising the following components: At least one processor; At least one memory for storing at least one program; When the at least one program is executed by the at least one processor, the at least one processor is caused to implement the method of any one of claims 1 to 7.
9. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 7.

Description

OCR (optical character recognition) method and system for large-model answer sheet based on detection and character segmentation Technical Field The invention relates to the technical field of optical character recognition, in particular to a large-model answer sheet OCR (optical character recognition) recognition method and system based on detection and character segmentation. Background The recognition accuracy of the Optical Character Recognition (OCR) technology based on the multi-mode large model under a complex scene is obviously superior to that of the traditional OCR method. However, in practical application scenarios such as K12 education and personal examination, the following technical challenges are faced by answer sheet OCR recognition: On the one hand, the handwriting on the answer sheet has large character style difference and various writing styles, and special phenomena such as continuous writing, correction, blurring and the like exist, the dependence of the general OCR model on training data is too high, and the recognition effect is difficult to meet the actual application requirements. The multi-mode large model can effectively solve the problems, but the huge parameter quantity and the calculation complexity of the multi-mode large model have extremely high requirements on hardware resources. On the other hand, when large models are deployed on many edge devices, there are serious resource constraints that the computational effort is limited, the memory capacity is small, and the direct input of high-resolution images is not supported. The large model usually requires input with fixed size, if the high-resolution answer sheet picture is directly subjected to size or fixed window cutting, character fracture and semantic information loss can be caused, and recognition accuracy is greatly reduced. When the prior art is used for solving the problems, a simple uniform segmentation strategy is often adopted, self-adaptive adjustment cannot be carried out according to the actual distribution of text content, and the character is easily segmented by mistake, so that the final recognition effect is influenced. Therefore, how to fully develop the OCR capability of the large model on the edge device with limited resources and avoid the information loss in the image preprocessing process becomes a key technical problem to be solved currently. Disclosure of Invention In order to solve the technical problems, the invention provides the large-model answer sheet OCR recognition method and system based on detection and character segmentation, which realize the efficient processing of the high-resolution answer sheet image on the side equipment by combining the lightweight text detection with the character interval segmentation and remarkably improve the OCR recognition speed and accuracy. In order to achieve the above purpose, the present invention provides the following technical solutions: on one hand, the embodiment of the invention provides a large-model answer sheet OCR (optical character recognition) method based on detection and character segmentation, which comprises the following steps: acquiring an answer sheet image dataset, and constructing a fine-tuning dataset for OCR (optical character recognition), wherein each sample in the dataset comprises an answer sheet image path and corresponding text content labeling information; performing fine adjustment on the multi-mode large model based on the fine adjustment data set to obtain an OCR recognition model special for the answer sheet, converting the OCR recognition model into a format supported by the side equipment, and then deploying the OCR recognition model to the side equipment; performing light text detection on an input answer sheet image, and positioning a text box area where each line of text content in the answer sheet is located; aiming at a text box area with the width exceeding the limit of the input size of the edge equipment, carrying out character interval segmentation operation to obtain a plurality of subgraphs; The subgraphs are spliced in high resolution according to the original text sequence, and image fragments conforming to the fixed input size of the side equipment are generated; Inputting the image fragment into an OCR recognition model deployed in the side equipment to perform OCR recognition, so as to obtain a character-level recognition result; and rearranging the character-level recognition result based on the original character position information recorded by the character segmentation points, and restoring the result to an output result consistent with the original text sequence of the answer sheet. Optionally, the constructing the fine tuning dataset for OCR recognition comprises: Pre-labeling the answer sheet image by adopting a multi-mode large model, and calling an API (application program interface) of the large model to obtain an OCR (optical character recognition) pre-labeling result; manually checking a