CN-121706767-B - End-side self-adaptive document structure understanding method and system

CN121706767BCN 121706767 BCN121706767 BCN 121706767BCN-121706767-B

Abstract

The invention provides an end-side self-adaptive document structure understanding method which comprises the steps of carrying out unified rendering and standardization on a document to be analyzed, outputting page-level pixel grids and basic metadata, carrying out lightweight layout analysis and region classification to obtain bounding boxes, reading sequences and region type labels of all regions in a page, carrying out parallel analysis on all document regions to corresponding special analysis channels, outputting a structured intermediate result and confidence coefficient by all analysis channels, carrying out consistency check and complementation reasoning on the intermediate result output by all channels, generating a traceable check evidence chain for low-confidence-coefficient fragments, fusing the checked channel results into unified document-level structured output, generating a channel-level increment weight packet for an adaptation flow of a new document type or continuous low-confidence coefficient mode, starting a parameter efficient fine adjustment technology, and updating model parameters of the analysis channels. The method has the beneficial effect that parallel and accurate analysis of different elements such as tables, formulas, texts and the like can be realized.

Inventors

GUO SHA
CEN JIE
HAO FANG

Assignees

深圳行胜数字技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260211

Claims (6)

1. An end-side adaptive document structure understanding method comprises the following steps: Step 1, obtaining a document to be analyzed, uniformly rendering and normalizing, and outputting a page-level pixel grid and basic metadata; Step 2, performing lightweight layout analysis and region classification on the page-level pixel grid to obtain bounding boxes, reading sequences and region type labels of all regions in the page; Step 3, routing each document area to a corresponding special analysis channel according to the type label for parallel analysis, wherein the special analysis channel at least comprises a text analysis channel, a form analysis channel, a formula analysis channel and a chart analysis channel; step 4, each analysis channel outputs a structured intermediate result and confidence, wherein the text analysis channel executes multi-level dynamic semantic segmentation on a text region, comprises generating candidate segmentation with various granularities based on layout features and semantic features, scoring each candidate boundary through a segmentation confidence evaluation module, adaptively adjusting the segmentation boundary according to the score, feeding back the result extracted according to downstream entity identification and relation, performing closed-loop correction on the segmentation boundary, and outputting a 'semantic block-entity-relation' structure; Step 5, performing consistency check and complementation reasoning on intermediate results output by each channel through a cross-channel cooperative controller, and generating a traceable check evidence chain for the low-confidence-degree segment, wherein the check content of the cross-channel cooperative controller comprises the consistency of table header terms and text description, the consistency of numerical units of the same index in texts and tables, the logic closure of statistical rows and data areas, the consistency of chart data and table values and the consistency of chart data and text values; step 6, the structured assembly engine fuses the checked channel results into a unified document-level structured output; And 7, when a new document type or a continuous Low confidence coefficient mode is detected, starting an Adaptation flow based on a parameter efficient fine tuning technology to generate a channel-level increment weight packet, updating model parameters of a corresponding analysis channel, realizing rapid Adaptation of the new document type and loading and using the new document type in follow-up reasoning on demand on the premise of not affecting the existing capability, wherein the parameter efficient fine tuning technology is Low-Rank Adaptation, namely LoRA, the Adaptation flow comprises the steps of only injecting a trainable Low-Rank matrix into an attention module of a model to perform increment training, training by using a small sample labeling set, generating an increment weight packet, and dynamically loading the corresponding increment weight packet according to the document type during reasoning.
2. The method for understanding the structure of an end-side adaptive document according to claim 1, wherein in said step 1, said document to be parsed includes a picture file, a PDF file, and a document file, and said basic metadata includes a page number, a resolution, and a direction.
3. The method for understanding the structure of an end-side adaptive document according to claim 2, wherein in said step 2, said region type label includes a text region, a table region, a formula region, and a chart region.
4. The method for understanding the structure of an end-side adaptive document according to claim 3, wherein in step 4, the formula parsing channel performs symbol detection and structure parsing on the formula area, and outputs LaTeX sequences and alignment information with page coordinates.
5. The method of claim 4, wherein in step 4, the chart parsing channel performs legend, coordinate axis recognition and curve data extraction on the chart region, and outputs a structurable numerical sequence, units and source evidence.
6. An end-side adaptive document structure understanding system for implementing the end-side adaptive document structure understanding method of any one of claims 1-5, comprising: The document rendering and normalization module is used for converting an input document into a page-level pixel grid and basic metadata; the layout analysis and classification module is used for identifying the document area and the type thereof; a plurality of special parsing channels for parsing the document area of the corresponding type, the system at least comprises a text analysis channel, a form analysis channel, a formula analysis channel and a chart analysis channel; The cross-channel cooperative controller is used for carrying out consistency check and evidence chain generation on the outputs of the text analysis channel, the form analysis channel, the formula analysis channel and the chart analysis channel; the structured assembly engine is used for fusing the results of the text analysis channel, the form analysis channel, the formula analysis channel and the chart analysis channel and outputting unified structured data; And LoRA the adaptation module is used for carrying out parameter efficient fine adjustment on the text analysis channel, the form analysis channel, the formula analysis channel and the chart analysis channel when a new document type or a low confidence coefficient mode is detected.

Description

End-side self-adaptive document structure understanding method and system Technical Field The invention relates to the technical field of artificial intelligent computing, in particular to an end-side self-adaptive document structure understanding method and system. Background With the increasing demands for intellectualization in the fields of office automation, medical record analysis and the like, the end-side document understanding technology becomes key. The prior art is mainly divided into three types, namely a traditional method based on rules and templates, which is effective for fixed format documents but has poor generalization capability, a method based on an end-to-end visual-language model (VLM), such as IBM Granit-Docling, which is used for trying to process all types of document elements by using a unified neural network, and a method based on a special model pipeline, such as Docling library which is used for realizing structured information extraction by integrating a plurality of single-function models, such as a table recognition (Tableformer), a formula parser, an OCR engine and the like, in a pipeline mode. The main problems and disadvantages of the prior art are: 1. the precision-efficiency paradox of the end-to-end VLM, the unified VLM model (such as Granite-Docling) is exquisite in design, but needs to accurately understand heterogeneous elements such as texts, tables, formulas, charts and the like and maintain complex structural relations, so that the requirement on model capacity is extremely high. In order to ensure accuracy, the model scale is difficult to be extremely light, calculation delay and memory pressure are still faced during end-side operation, and the processing mode of 'one-tool cutting' is not as deep as a special model when facing certain special tasks (such as logic relation analysis of a complex table). 2. The multi-model pipeline scheme decomposes tasks with the error propagation and synergy overhead of the pipeline architecture, while it is possible to obtain optimal accuracy on each subtask, there is a serious risk of error accumulation. For example, layout analysis errors of the lead module may directly result in failure of subsequent form or formula parsing. Meanwhile, a plurality of models are sequentially operated to bring accumulated delay, and complex engineering scheduling is needed for data exchange and coordination among the models, so that system complexity and unstable factors are increased, and the method can obtain optimal precision on each subtask, but has the problems of error accumulation risk, delay accumulation and high system complexity. 3. The self-adaptive capacity is poor, the deployment cost is high, when a new document type (such as a new bill style or a report template) appears, a large amount of new annotation data and calculation resources are required for full-parameter fine adjustment whether the end-to-end VLM is retrained or the whole assembly line is adjusted, the cost is high, the period is long (usually several weeks are required), and the service requirement of quick iteration cannot be met. Therefore, there is a need for an end-side document parsing scheme that combines high performance, high efficiency and high adaptivity. Disclosure of Invention In order to solve the problems in the prior art, the invention provides an end-side self-adaptive document structure understanding method, which can realize parallel and accurate analysis of different elements such as tables, formulas, texts and the like through an innovative shunt processing architecture, greatly reduce deployment and maintenance thresholds, realize unification of current analysis capability and future self-evolution capability, and solve the problems of low performance, low efficiency and low self-adaptability of end-side document analysis in the prior art. The invention discloses an end-side self-adaptive document structure understanding method, which is applied to a system formed by a cloud server and a mobile terminal and comprises the following steps: Step 1, obtaining a document to be analyzed, uniformly rendering and normalizing, and outputting a page-level pixel grid and basic metadata; Step 2, performing lightweight layout analysis and region classification on the page-level pixel grid to obtain bounding boxes, reading sequences and region type labels of all regions in the page; Step 3, routing each document area to a corresponding special analysis channel according to the type label for parallel analysis, wherein the special analysis channel at least comprises a text analysis channel, a form analysis channel, a formula analysis channel and a chart analysis channel; step 4, outputting a structured intermediate result and confidence coefficient by each analysis channel; step 5, performing consistency check and complement reasoning on the intermediate results output by each channel through the cross-channel cooperative controller, and generating a traceable