CN-121811440-B - Analysis method and system for demonstration document

CN121811440BCN 121811440 BCN121811440 BCN 121811440BCN-121811440-B

Abstract

The invention provides a method and a system for analyzing a presentation document, wherein the method for analyzing the presentation document comprises the steps of obtaining identification results of all pages of the presentation document, wherein the identification results comprise intermediate state JSON data of all pages and text content in a MarkDown format, obtaining a structured information extraction outline of the presentation document according to user intention and the identification results, determining key pages of the presentation document corresponding to the structured information extraction outline, and inputting an original page image, a text identification result and a chart structure code of the obtained key pages into a visual large model to output a presentation document structured analysis result conforming to the user intention. According to the method and the device, the outline is extracted according to the structural information of the demonstration document which is generated according to the intention of the user, so that the user can automatically plan the analysis path of the demonstration document without the need of predicting the structure and the content of the demonstration document, and the use threshold is greatly reduced.

Inventors

XIA TIAN
ZHANG YIWEN
Bai qi
YU DINGDING
XU QING
WANG HAORAN
CAO PEI
SHEN XULI
ZHAO SHUANG
LI HANWEN
ZHANG TAIYU

Assignees

华院计算技术(上海)股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260310

Claims (9)

1. The method for analyzing the presentation document is characterized by comprising the following steps: acquiring identification results of all pages of the demonstration document, wherein the identification results comprise intermediate state JSON data of all pages and text content in a MarkDown format; obtaining the structured information extraction outline of the demonstration document according to the user intention and the identification result; Determining key pages of the presentation document corresponding to the structured information extraction outline; inputting the obtained original page image, text recognition result and chart structure code of the key page into a visual large model to output a structural analysis result of the demonstration document which accords with the user intention; The step of obtaining the recognition results of all pages of the demonstration document comprises the following steps: performing raster rendering on all pages of the demonstration document to obtain original page images of all pages; inputting the original page image of any page into a layout analysis model to output the boundary frame coordinates of the target element of any page; Inputting the boundary box coordinates of the target elements of any page into a text recognition model to output the text content of the target elements of any page; determining the reading sequence of all target elements of any page through a topological sorting algorithm, and generating intermediate state JSON data and text content in a MarkDown format of any page according to the reading sequence; And determining the intermediate state JSON data and the text content in the MarkDown format of all pages as the identification results of all pages of the demonstration document.
2. The method of parsing a presentation document according to claim 1, wherein the step of acquiring a structured information extraction outline of the presentation document according to a user's intention and the recognition result comprises: acquiring an intention label of a user, wherein the intention label is used for representing wide user intention; And inputting the intention labels and the recognition results into a large language model to output the structured information extraction outline of the demonstration document.
3. The method of parsing a presentation document according to claim 1, wherein the step of determining key pages of the presentation document to which the structured information extraction outline corresponds comprises: Determining a first page of the presentation document with highest relativity with the structured information extraction outline through a BM25 algorithm; comparing the structured information extraction outline with the identification result, and determining a second page according to a preset rule; And determining the first page and the second page as key pages.
4. The method for parsing a presentation document according to claim 1, wherein the original page image is obtained by rasterizing rendering; And/or the number of the groups of groups, The text recognition result is text content in a MarkDown format of the key page; And/or the number of the groups of groups, The chart structure code is obtained by inputting charts in the key page into a super lightweight document parsing model.
5. The method of parsing a presentation document according to claim 1, wherein the target element includes at least one of a title, a text paragraph, a picture, a table, a header, and a footer; And/or the number of the groups of groups, The step of determining the reading sequence of all target elements of any page through the topological sorting algorithm comprises the following steps: traversing all target elements, and establishing directed edges of reading precedence relations based on a preset sequence rule; And executing a Kane algorithm according to the directed edges to sort all the target elements so as to obtain the reading sequence of all the target elements.
6. A parsing system for a presentation document, the parsing system for a presentation document comprising: the acquisition module is used for acquiring the identification results of all pages of the demonstration document, wherein the identification results comprise intermediate state JSON data of all pages and text content in a MarkDown format; the outline module is used for acquiring the structured information of the demonstration document according to the user intention and the identification result to extract an outline; The determining module is used for determining key pages of the presentation document corresponding to the structured information extraction outline; the analysis module is used for inputting the acquired original page image, text recognition result and chart structure code of the key page into the visual large model so as to output a structural analysis result of the demonstration document which accords with the intention of the user; The parsing system of the presentation document further includes: the rasterization module is used for carrying out rasterization rendering on all pages of the demonstration document so as to obtain original page images of all pages; The layout analysis module is used for inputting the original page image of any page into the layout analysis model so as to output the boundary frame coordinates of the target element of any page; The text recognition module is used for inputting the boundary box coordinates of the target element of any page into the text recognition model so as to output the text content of the target element of any page; the sequencing module is used for determining the reading sequence of all target elements of any page through a topological sequencing algorithm and generating intermediate state JSON data and text content in a MarkDown format of any page according to the reading sequence; And the determining module is also used for determining the intermediate state JSON data of all pages and the text content in the Markdown format as the identification results of all pages of the demonstration document.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory for execution on the processor, wherein the processor implements the method of parsing a presentation document according to any one of claims 1 to 5 when the computer program is executed.
8. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the parsing method of a presentation document according to any one of claims 1 to 5.
9. A computer program product comprising a computer program which, when executed by a processor, implements a method of parsing a presentation document according to any one of claims 1 to 5.

Description

Analysis method and system for demonstration document Technical Field The disclosure relates to the technical field of intelligent analysis of documents, in particular to an analysis method and an analysis system for demonstration documents. Background In the scenes of government document processing, enterprise knowledge management, intelligent bidding analysis, automatic audit and the like in the digital age, a demonstration manuscript (PPT/PPTX) is used as a core carrier for information transmission, and a large amount of key business data such as a system architecture diagram, a fund flow chart, a project milestone planning diagram and the like are carried. Unlike linear streaming documents such as document files (Word), portable Document Formats (PDF), etc., PPT documents have the characteristics of "high freedom of layout, dominant visual information, and strong logic jumping", which makes it difficult for conventional rule-based or single-modality-based document parsing techniques to achieve good results. In the prior art, when the PPT document is processed and extracted in a structuring way, the following three technical bottlenecks are mainly faced, so that a closed loop cannot be formed in the technical scheme: First, relying on manually defined extraction rules lacks automated closed loop capability. Existing document parsing schemes (e.g., RAG systems or conventional ETL tools) typically employ a "gap-filling" interaction mode, requiring the user to pre-configure detailed extraction fields (Schema). For example, in processing setback materials, the user must explicitly specify the system to extract specific fields such as "project amount", "construction period", etc. This mode relies heavily on a priori knowledge of the user. In the face of mass documents (such as historical stand materials, bid analysis reports and industry research reports) with unknown mass structures and various service types, a user cannot predict which core plates are contained in the documents, so that a system cannot automatically comb out knowledge venues, an unattended closed loop from document input to knowledge output cannot be realized by the system, and the application of the system in construction of a knowledge base is greatly limited. Secondly, the heterogeneous information fusion mode is split, and the multi-mode synergistic effect is poor. The main stream scheme generally adopts a divide-and-conquer strategy, namely, characters are extracted by utilizing an OCR technology (optical character recognition technology), pictures are cut by utilizing a target detection model, and finally, simple splicing or separate storage is carried out on a text layer. However, the information in PPT is often complementary in graphics and even map Wen Jiang coupled. For example, the text description refers to "specific flow is right-hand drawing", while the architecture diagram on the right side shows specific module interactions. If the OCR text and picture are split, or the OCR result is extracted by using a Large Language Model (LLM), visual semantics (such as states represented by colors and flow directions represented by arrows) in the picture are lost, and if the picture is processed by using a multi-modal large model (VLM), illusion is generated on recognition of tiny text due to lack of high-precision text assistance of OCR. The existing 'LLM fusion' scheme cannot fully utilize the original capability of the new generation VLM for simultaneously processing 'visual pixels' and 'auxiliary texts', so that the final analysis precision is often determined by a short-plate mode, and the effect of 1+1>2 cannot be realized. Third, the semantic understanding depth of complex specialized charts is inadequate. The PPT contains a large number of specialized charts, such as a software architecture chart, a network topology chart, a business line chart, a gatekeeper chart, and the like. These charts contain not only visual information, but also strict logical topological relationships. Although a general Visual Large Model (VLM) can well describe natural images, when a professional architecture diagram containing dense text labels and complex connection relationships is faced, a specific topological structure is often difficult to understand, and logic inversion and other conditions often occur (for example, a- > B of a data flow direction is misjudged as B- > a). The existing scheme lacks an intermediate state conversion mechanism specific to the chart, can not accurately convert visual arrow pointing into logical data flow, so that analysis of the technical scheme PPT is stopped on the surface description, and accurate structured data can not be extracted. Disclosure of Invention The technical problems to be solved by the method and the system for analyzing the presentation document are to overcome the defects that in the prior art, manual definition extraction rules are relied, automatic closed-loop capability is lacked, heterogeneou