CN-121982737-A - Table data extraction method and device based on synergy of OCR and visual language model

CN121982737ACN 121982737 ACN121982737 ACN 121982737ACN-121982737-A

Abstract

The embodiment of the specification provides a table data extraction method and a table data extraction device based on the synergy of OCR and a visual language model, wherein the method constructs a complete synergy processing framework from image preprocessing to structured data output by creatively fusing the accurate positioning capability of optical character recognition and the deep semantic understanding capability of the visual language model, thereby effectively overcoming the limitation of the prior art. The method abandons the dependence on the fixed template and the large-scale labeling data, intelligently judges the form types through the self-adaptive structure analysis, ensures the processing robustness under various abnormal conditions by utilizing a multi-layer progressive fault-tolerant mechanism, and remarkably improves the generalization adaptability of the system to forms with different formats and different qualities. Meanwhile, by introducing comprehensive verification and intelligent error correction processes of covering format, semantics and consistency, output data is accurate not only at a character level, but also reasonable and reliable at a business logic level.

Inventors

TANG KANGJIE
DU CHAO
WANG HONGMEI
LI JIANHUA
LI JINQING
ZHANG ZHI

Assignees

北京易智时代数字科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260401

Claims (10)

1. A method for extracting tabular data based on the cooperation of OCR and a visual language model, comprising: Preprocessing and optical character recognition are carried out on the input form document image to obtain a plurality of text boxes and coordinates and recognized text contents thereof; Detecting a table structure of the image, dividing a boundary box of a plurality of cells, matching the coordinates of each text box with the boundary box of the cell, and associating the identified text content to the corresponding cell to form a cell set containing a text content set; filtering the cell set to obtain an effective data cell set, extracting features and analyzing structures of the effective data cell set, and judging the overall structure type of a table; Based on the overall structure type, inputting an image area corresponding to the effective data cell set and a structured prompt word into a visual language model for batch recognition to obtain a primary recognition result containing a field value and a semantic data type, and executing multi-layer fault-tolerant processing on the primary recognition result to obtain a field recognition result after supplement or correction; and carrying out format verification, semantic verification and consistency verification on the field identification result, carrying out intelligent error correction on the verified problem, and generating and outputting final structured data based on the overall structure type and the field identification result subjected to verification and error correction.
2. The method of claim 1, wherein said performing feature extraction and structure analysis on said set of active data cells to determine an overall structure type of a table comprises: extracting semantic features from text contents of the effective data cell sets, and detecting whether grouping keywords exist or not; Extracting visual features from the image, and detecting whether a transverse line and a vertical line separation line exist; Analyzing the space distribution of the effective data cell set in the image, and extracting layout features; And based on the fusion analysis result of the semantic features, the visual features and the layout features, judging the overall structure type of the table to be a simple table or a grouping table through a preset rule.
3. The method of claim 2, wherein the batch recognition of the image region corresponding to the set of valid data cells and the structured prompt word input visual language model based on the overall structure type comprises: Batching the set of valid data cells according to the overall structure type and the context length constraint of the visual language model; constructing input data for each batch of the effective data cells, wherein the input data comprises a cut image area and the structured prompt words, and the structured prompt words define fields to be extracted, output formats and business rule constraints; and calling the visual language model to process the input data, and outputting the field value and the semantic data type of each effective data cell in the batch.
4. The method of claim 1, wherein the multi-layer fault tolerant process comprises the following levels performed in sequence: the first layer inputs an image area of a corresponding cell and a prompt word containing more detailed context into the visual language model to carry out cell re-recognition when the confidence coefficient of the batch recognition result of the visual language model is lower than a threshold value; The second layer, when the Shan Gechong is failed to identify, uses the regular expression corresponding to the field type to carry out matching extraction from the text content set associated to the cell; The third layer, when the regular expression matching fails, a rule engine is called and the global logic relation among the successfully identified fields is utilized for reasoning and completing; and the fourth layer marks the field which still cannot obtain the effective value after the third layer treatment as needing manual auditing.
5. The method of claim 1, wherein said performing format verification, semantic verification, and consistency verification on said field identification result, and performing intelligent error correction on said verified problem, comprises: Performing format verification, and checking whether the numerical value or text format of each field identification result accords with the predefined semantic data type standard; Executing semantic verification, and checking whether the numerical value of each field identification result is reasonable in service logic; Performing consistency verification, and checking whether logic relations among different field identification results accord with predefined global constraints; Based on the common confusion pattern of the characters and the domain knowledge base, the format errors or logic contradictions found in the verification process are corrected.
6. The method of claim 1, wherein the extracting document-level global information from the image comprises: intercepting a top preset area of the image, and inputting the image of the top preset area and a first prompt word into the visual language model together to extract at least one metadata of a title, a number and a date; And when the visual language model extraction fails, matching and extracting the metadata through a predefined regular expression pattern from all the recognized text contents obtained by the optical character recognition.
7. The method of claim 1, wherein the filtering the set of cells comprises: Based on position filtering, eliminating cells positioned in a preset percentage area at the top of the image and in preset percentage areas at the left and right edges of the image; based on size filtering, eliminating the cells with width or height smaller than a preset pixel threshold; based on content filtering, eliminating the cells of which the text content set is empty.
8. A form data extraction apparatus based on OCR in conjunction with a visual language model, comprising: The content recognition module is configured to perform preprocessing and optical character recognition on the input form document image to obtain a plurality of text boxes and coordinates and recognition text content thereof; The set dividing module is configured to detect a table structure of the image and divide a boundary box of a plurality of cells, match coordinates of each text box with the boundary box of the cell, and associate the identified text content to the corresponding cell to form a cell set containing a text content set; the type determining module is configured to extract document-level global information from the image, filter the cell set to obtain an effective data cell set, extract characteristics and analyze structures of the effective data cell set, and determine the overall structure type of the table; the result recognition module is configured to input the image area corresponding to the effective data cell set and the structured prompt word into a visual language model for batch recognition based on the overall structure type to obtain a primary recognition result containing field values and semantic data types, and execute multi-layer fault-tolerant processing on the primary recognition result to obtain a field recognition result after supplement or correction; The data generation module is configured to perform format verification, semantic verification and consistency verification on the field identification result, perform intelligent error correction on the verified problem, and generate and output final structured data based on the overall structure type and the field identification result subjected to verification and error correction.
9. A computing device, comprising: A memory and a processor; the memory is configured to store computer-executable instructions that, when executed by the processor, perform the steps of the OCR-and-visual language model-based tabular data extraction method recited in any one of claims 1 to 7.
10. A computer readable storage medium storing computer executable instructions which when executed by a processor implement the steps of the OCR-and visual language model-based tabular data extraction method of any one of claims 1 to 7.

Description

Table data extraction method and device based on synergy of OCR and visual language model Technical Field The embodiment of the specification relates to the technical field of image recognition, in particular to a table data extraction method based on the cooperation of OCR and a visual language model. Background In the current digital transformation process of various industries, a large amount of key data is stored in a form document in a paper or electronic image format, and the automatic and high-precision extraction of the key data is a core requirement for improving the operation efficiency. The prior art scheme is mainly divided into two types, namely a method based on combination of optical character recognition and a fixed rule template, wherein the method can realize automation to a certain extent, but is highly dependent on a pre-defined coordinate or keyword template, cannot understand field semantics, is extremely sensitive to form change, lacks the capability of processing complex layout and logic relationship, causes poor generalization and high maintenance cost, and is an end-to-end form recognition model based on deep learning, and the method is trained by a large amount of labeling data to understand a document structure, so that the dependence on the fixed template is overcome to a certain extent, but the performance is severely limited by the scale and quality of training data, and the inherent defects of huge labeling cost, insufficient model generalization capability, poor interpretability and difficulty in logic verification by being integrated into dynamic business rules are faced. In addition, the prior art generally lacks an effective multi-level fault-tolerant mechanism and intelligent data verification and error correction capability, and is difficult to ensure stable output with high robustness and high accuracy when dealing with actual scenes with poor image quality, various layouts and complex field logic, so that the large-scale practical application of the table data extraction technology is limited. Thus, a better solution is needed. Disclosure of Invention In view of this, the present description embodiments provide a form data extraction method based on OCR in conjunction with a visual language model. One or more embodiments of the present specification are also directed to a table data extraction apparatus, a computing device, a computer-readable storage medium, and a computer program based on OCR in conjunction with a visual language model, to solve the technical drawbacks of the related art. According to a first aspect of embodiments of the present specification, there is provided a table data extraction method based on OCR in conjunction with a visual language model, including: Preprocessing and optical character recognition are carried out on the input form document image to obtain a plurality of text boxes and coordinates and recognized text contents thereof; detecting a table structure of the image, dividing a boundary box of a plurality of cells, matching the coordinates of each text box with the boundary box of the cell, and associating the identified text content with the corresponding cell to form a cell set containing a text content set; Filtering the cell set to obtain an effective data cell set, extracting features and analyzing the structure of the effective data cell set, and judging the type of the overall structure of the table; Based on the overall structure type, inputting an image area corresponding to the effective data cell set and the structured prompt words into a visual language model for batch recognition to obtain a primary recognition result containing field values and semantic data types; And carrying out format verification, semantic verification and consistency verification on the field identification result, carrying out intelligent error correction on the verified problem, and generating and outputting final structured data based on the overall structure type and the field identification result subjected to verification and error correction. In one possible implementation, feature extraction and structure analysis are performed on the set of valid data unit cells to determine the overall structure type of the table, including: extracting semantic features from text contents of the effective data cell sets, and detecting whether grouping keywords exist or not; analyzing the spatial distribution of the effective data cell set in the image, and extracting layout features; Based on the fusion analysis result of the semantic features, the visual features and the layout features, the overall structure type of the table is judged to be a simple table or a grouping table through a preset rule. In one possible implementation manner, based on the overall structure type, the batch identification of the image area corresponding to the valid data cell set and the structured prompt word input visual language model includes: Batching the set of valid da