CN-122021571-A - Multi-mode document processing method, system and equipment

CN122021571ACN 122021571 ACN122021571 ACN 122021571ACN-122021571-A

Abstract

The application provides a multi-mode document processing method, a system and equipment, wherein the method is used for acquiring multi-source heterogeneous files, converting the multi-source heterogeneous files into standard documents, conducting layout element analysis on the standard documents to determine a layout block set, analyzing spatial relations among the document semantic blocks and visual flow and semantic association according to each document semantic block in the layout block set to generate a global content sequence conforming to a reading sequence, constructing an optimal analysis pipeline according to the types of the document semantic blocks in the global content sequence, analyzing the document semantic blocks based on the optimal analysis pipeline to determine analysis results, and generating structural data and outputting the analysis results. The application corrects the reading sequence, ensures sequential logic continuity, solves the problem of semantic disjoint caused by only depending on space position, and solves the problems of information dispersion and difficult multiplexing by integrating the dispersed recognition results into structured data.

Inventors

WANG XIAOHU
HU LEIXIN
LIANG YUQIAN

Assignees

广域铭岛数字科技有限公司
浙江吉利控股集团有限公司

Dates

Publication Date: 20260512
Application Date: 20260330

Claims (10)

1. A method of multimodal document processing comprising: acquiring a multi-source heterogeneous file, and converting the multi-source heterogeneous file into a standard document; performing layout element analysis on the standard document, and determining a layout set, wherein the layout set comprises geometric coordinates and type tags corresponding to semantic blocks of each document; according to the geometric coordinates and type labels of each document semantic block in the layout set, analyzing the spatial relationship, visual flow and semantic association among the document semantic blocks to generate a global content sequence conforming to the reading order; constructing an optimal analysis pipeline according to the types of the semantic blocks of each document in the global content sequence, analyzing the semantic blocks of each document based on the optimal analysis pipeline, and determining an analysis result; And generating structural data from the analysis result and outputting the structural data.
2. The method of claim 1, further comprising, prior to performing layout analysis on layout elements in the standard document: Extracting features of at least one part of the standard document, and determining a feature image, wherein the feature image comprises document types, layout styles and element distribution; If the document type in the feature portrait is a preset type, preprocessing the standard document to determine the document-enhanced standard document, wherein the preprocessing comprises resolution enhancement, page rotation correction and image quality optimization; And if the document type in the feature portrait is not the preset type, directly outputting the standard document.
3. The method of claim 1, wherein performing layout element analysis on the standard document to determine a set of layout blocks comprises: Calling a preset document layout analysis model to analyze layout elements of the standard document, and determining geometric coordinates and type labels corresponding to semantic blocks of each document; and associating the geometric coordinates corresponding to each document semantic block with a type tag to form independent data, and packaging the independent data of each document semantic block to form a block set, wherein the type tag comprises at least one of a title, a text paragraph, a table, a mathematical formula, a chart, a list, a code block and an annotation.
4. The multi-modal document processing method according to claim 1, wherein analyzing spatial relationships, visual flows and semantic associations between the document semantic blocks according to geometric coordinates and type tags of each document semantic block in the layout set to generate a global content sequence conforming to a reading order comprises: Analyzing the spatial relationship, visual flow and semantic association among the document semantic blocks in the layout set through a preset reading sequence ordering model to generate a spatial relationship score, a visual flow score and a semantic association score; Carrying out weighted calculation on the spatial relationship score, the visual stream score and the semantic association score of each document semantic block according to a preset weight coefficient, and determining a weighted score value; And sequencing according to the weighted score values corresponding to the document semantic blocks, and reorganizing according to the scores from high to low to generate a global content sequence conforming to the reading sequence.
5. The multi-modal document processing method according to claim 1, wherein analyzing spatial relationships, visual flows and semantic associations between the document semantic blocks according to geometric coordinates and type tags of each document semantic block in the layout set to generate a global content sequence conforming to a reading order comprises: Determining a spatial relationship matrix between the document semantic blocks based on the geometric coordinates of each document semantic block; Based on the type labels and the spatial relation matrix, establishing a semantic relation matrix among the document semantic blocks; determining a visual flow path according to a version structure formed by the spatial relation matrix of the document semantic block and a preset global reading direction; and carrying out weighted fusion on the spatial relation matrix, the visual flow path and the semantic association matrix to determine a global content sequence conforming to the reading order.
6. The method of claim 1, wherein constructing an optimal parsing pipeline according to the type of each document semantic block in a global content sequence, parsing each document semantic block based on the optimal parsing pipeline, and determining a parsing result comprises: performing character recognition on the document semantic block with the type tag being at least one of a title, a text paragraph, a code block and an annotation through a preset optical character recognition model, and determining a document recognition result; Positioning the document semantic block of the type tag into a mathematical formula through a preset formula positioning model, and analyzing the positioned mathematical formula through a preset formula analysis model to determine a formula identification result; Classifying the document semantic blocks with the type labels of charts, lists and tables through a preset table classification model to determine a wired table and a wireless table; Identifying the type of each document semantic block in the global content sequence, so as to construct an optimal analysis pipeline; and calling a preset analysis model according to the optimal analysis assembly line to analyze each document semantic block, and taking at least one of the document identification result, the table identification result and the formula identification result as an analysis result.
7. The method of claim 6, wherein determining a form recognition result by recognizing respective contents of the wired form and the wireless form comprises: performing line frame detection on the wired table through a preset line table analysis model to determine a reconstruction cell; Performing text alignment and semantic interval inference on the wireless table through a preset wireless table analysis model to determine a table structure; And fusing the identification results of the reconstruction unit cells and the table structure to generate a table identification result containing complete row-column logic and two-dimensional structured data.
8. The method according to any one of claims 1 to 7, wherein after generating the structured data from the parsing result, further comprising: Optimizing the structured data based on reading order and semantic analysis, wherein the optimizing mode comprises automatically connecting broken paragraphs, list items or tables in a page crossing and filling missing logical connection words, and the reading order and the semantic analysis are determined by the global content sequence; And checking whether the optimized file structure in the structured data has discontinuous title numbers, missing chart references or missing key chapters, and generating a check report.
9. A multi-modal document processing system, comprising: the acquisition module is used for acquiring the multi-source heterogeneous file and converting the multi-source heterogeneous file into a standard document; The layout analysis module is used for carrying out layout element analysis on the standard document and determining a layout set, wherein the layout set comprises geometric coordinates and type labels corresponding to semantic blocks of each document; the sequence reorganization module is used for analyzing the spatial relationship, visual flow and semantic association among the document semantic blocks according to the geometric coordinates and the type labels of each document semantic block in the layout block set so as to generate a global content sequence conforming to the reading sequence; the intelligent analysis module is used for constructing an optimal analysis pipeline according to the types of the semantic blocks of each document in the global content sequence, analyzing the semantic blocks of each document based on the optimal analysis pipeline and determining analysis results; And the structured data module is used for generating structured data from the analysis result and outputting the structured data.
10. An electronic device comprising a processor, a memory and a communication bus for connecting the processor and the memory, the processor being adapted to execute a computer program stored in the memory for implementing the multi-modal document processing method according to any one of claims 1 to 8.

Description

Multi-mode document processing method, system and equipment Technical Field The present application relates to the field of intelligent document processing, and in particular, to a method, a system, and an apparatus for processing a multi-modal document. Background Along with the rapid development of informatization and intellectualization technologies, document analysis and generation technologies are widely applied in the fields of office automation, data processing, content generation and the like. At present, document parsing technology has been developed in a multi-modal fusion stage, which combines visual, text and layout information to improve understanding capability and adopts a model integration or Pipeline mode to process complex tasks. However, the document analysis technology still has the following defects that firstly, the processing flow and strategy are fixed, dynamic adjustment cannot be carried out according to the huge differences of document types, quality and complexity, so that excessive processing and resource waste are caused to a simple document, and processing is insufficient to a complex or low-quality document, secondly, the calculation efficiency is low due to the fact that a large number of single models are stacked or independent heavy models are relied on, the newly added document types need to be re-marked with data and training, the expandability and maintainability are poor, and thirdly, the multiple models are in an isolated state, and an effective cooperative mechanism and knowledge sharing capability are lacked, so that small samples cannot be rapidly adapted to and feedback-based autonomous evolution can not be realized. Therefore, there is a need for an intelligent scheduling method capable of autonomously evaluating documents and intelligently scheduling multi-model resources, and having adaptive parsing capability, so as to break through the bottleneck of the current technology in terms of flexibility, efficiency and generalization capability. Disclosure of Invention The application provides a multi-mode document processing method, a system and equipment, which are used for solving the problem that in the prior art, the multi-mode document processing is faced with multi-source heterogeneous documents, and the documents cannot be accurately and efficiently processed according to the document types and the complexity. The multi-mode document processing method includes the steps of obtaining a multi-source heterogeneous file, converting the multi-source heterogeneous file into a standard document, conducting layout element analysis on the standard document, determining a layout set, wherein the layout set comprises geometric coordinates and type labels corresponding to semantic blocks of each document, analyzing spatial relations among the semantic blocks of each document and visual flow and semantic relations according to the geometric coordinates and the type labels of the semantic blocks of each document in the layout set to generate a global content sequence conforming to a reading sequence, constructing an optimal analysis pipeline according to types of the semantic blocks of each document in the global content sequence, analyzing the semantic blocks of each document based on the optimal analysis pipeline, determining analysis results, generating structured data according to the analysis results, and outputting the structured data. In some possible embodiments of the first aspect, before performing layout analysis on layout elements in the standard document, the method further comprises the steps of extracting features from at least one part of the standard document, determining a feature image, wherein the feature image comprises a document type, a layout style and element distribution, preprocessing the standard document if the document type in the feature image is a preset type, determining a document-enhanced standard document, wherein the preprocessing comprises resolution enhancement, page rotation correction and image quality optimization, and directly outputting the standard document if the document type in the feature image is not the preset type. In some possible embodiments of the first aspect, performing layout element analysis on a standard document to determine a layout set includes calling a preset document layout analysis model to perform layout element analysis on the standard document to determine geometric coordinates and type labels corresponding to each document semantic block, associating the geometric coordinates and the type labels corresponding to each document semantic block into independent data, and packaging the independent data of each document semantic block to form the layout set, wherein the type labels include at least one of a title, a text paragraph, a table, a mathematical formula, a chart, a list, a code block and an annotation. In some possible embodiments of the first aspect, according to the geometric coordinates and type