CN-121997922-A - Multidimensional PDF document analysis and metadata extraction method
Abstract
The invention discloses a multi-dimensional PDF document analysis and metadata extraction method, and relates to the technical field of data processing. The invention constructs a layered, cooperative and self-adaptive mixed analysis framework. Firstly, preprocessing and analyzing PDF documents, and judging language and type attributes of the PDF documents. And then, carrying out splitting processing according to a judging result, wherein Grobid is adopted to analyze English text type PDF preferentially, and an analysis scheme based on an OCR engine is adopted to Chinese or picture type PDF. For the uncovered area or low confidence metadata of the method, a refined extraction strategy based on a large language model and a multi-OCR voting mechanism is innovatively introduced to improve the accuracy. In addition, the method further comprises the steps of text reorganization based on coordinate mapping, reference document structured analysis based on three-layer progressive rules and the like. The invention realizes high-precision and automatic extraction of PDF document metadata, text and reference document information on the premise of controlling cost.
Inventors
- WANG LEI
- ZHENG LEI
- LI TING
- CHANG QING
- CAI WENHUI
Assignees
- 合肥机数量子科技有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20251225
Claims (10)
- 1. The multidimensional PDF document analysis and metadata extraction method is characterized by comprising the following steps of: step 1, preprocessing and page streaming are carried out on a PDF document, each page of the PDF document is independently converted into a high-resolution byte stream or an image buffer area format, and the byte stream or the image buffer area format is recorded as an ordered list [ page-1, page-2, ], and standard and unified input is provided for subsequent analysis; Step 2, judging language attribute and type attribute of the document through sampling statistics and heuristic analysis strategy; Step 3, receiving and checking the user-defined metadata template uploaded by the user, and injecting the user-defined metadata template into a subsequent extraction link after the user-defined metadata template is checked successfully to serve as a priority basis for metadata extraction; Step 4, starting an analysis path based on Grobid for the English text type document judged in the step 2, utilizing a Grobid server to conduct efficient analysis, conducting rough extraction of metadata on an uncovered page area analyzed by Grobid, and conducting file type judgment by combining rules; Step 5, for the document judged as Chinese text type or picture type in the step 2, starting an analysis path based on an OCR engine, processing all pages by utilizing the OCR engine with layout analysis capability, performing rough extraction of metadata on an area of an OCR analysis uncovered page, and judging file type by combining rules; step 6, metadata fine extraction is carried out on the uncovered page area based on a large language model and a voting mechanism; Step 7, carrying out coordinate system standardized mapping on the analysis result, and carrying out global sequencing and splicing according to the text block coordinates so as to reorganize the text structure; Step 8, identifying and structuring analysis reference document blocks through three layers of progressive rules; And 9, fusing the refined metadata, the structured text and the reference document, and synthesizing a final structured document object according to a predefined template for output.
- 2. The method for multi-dimensional PDF document parsing and metadata extraction of claim 1, wherein said step 2 includes: Step 2.1, language attribute judgment, namely selecting a plurality of representative pages by adopting a sampling strategy, using a lightweight OCR engine to perform quick scanning, counting the proportion of Chinese characters to English characters in the extracted plain text, determining the dominant language type of the document according to a preset threshold value, and taking the dominant language type as the language attribute of the document; and 2.2, judging the type attribute, namely when the language attribute is judged to be English, attempting to analyze the document heuristically by using Grobid, if the analysis is successful, marking the document as a text type, and if the analysis is failed, judging the document as a picture type.
- 3. The method for multi-dimensional PDF document parsing and metadata extraction of claim 1, wherein said step 4 includes: step 4.1, submitting all pages to Grobid server for analysis; Step 4.2, marking the successfully resolved area as covered by utilizing coordinate information output by Grobid server, and marking the remaining area as uncovered page area-s; and 4.3, analyzing the uncovered page area page-s by using an OCR engine to obtain a preliminary analysis text, and carrying out metadata extraction and file type judgment by combining Grobid analysis results by using a regular expression analyzer.
- 4. The method for multi-dimensional PDF document parsing and metadata extraction of claim 1, wherein said step 5 includes: Step 5.1, processing all pages by using an OCR engine with layout analysis capability, and outputting a recognition result with a hierarchical structure and coordinate information; step 5.2, defining rules according to prior knowledge of academic document structures, identifying core content blocks and marking the core content blocks as covered, and marking the residual areas as uncovered page areas page-s; and 5.3, applying a targeted regular rule set to extract basic metadata and primarily judge the file type.
- 5. The method for analyzing and extracting metadata from a multi-dimensional PDF document according to claim 3 or 4, wherein the judgment of the document type outputs a confidence score through the matched feature combination, and outputs a "academic paper" or a "patent document" if the score is greater than a set threshold, and a custom type if the score is less than or equal to the set threshold.
- 6. The method for multi-dimensional PDF document parsing and metadata extraction of claim 1, wherein said step 6 includes: Step 6.1, respectively carrying out text recognition on the uncovered page area page-s by using a plurality of different lightweight OCR engines to obtain different text expression versions of the same content; Step 6.2, generating instruction prompt words extracted from the large model metadata according to the preliminarily judged file types or user-defined metadata templates; Step 6.3, submitting the prompt word to a large language model for independent metadata extraction to obtain a group of metadata extraction results; Step 6.4, for each metadata field, collecting values of the metadata field in all extraction results, and selecting a final value by adopting a decision fusion strategy based on frequency statistics; And 6.5, merging and conflict resolution are carried out on the metadata of each field after the fusion determination and the stable information extracted by the main analysis path, so as to form a complete document metadata dictionary.
- 7. The method for analyzing and extracting metadata from multi-dimensional PDF documents according to claim 6, wherein in the step 6.4, the decision fusion strategy based on frequency statistics includes majority voting and weighted averaging, and the value with the highest occurrence frequency is selected as the final value of metadata values extracted by multiple OCR engines.
- 8. The method for analyzing and extracting the metadata from the multi-dimensional PDF document according to claim 7, wherein in the step 6.4, for the case that the frequency of occurrence is the same or all the results are null, an alternative strategy is sequentially adopted, wherein the selection value is preferentially selected according to the historical confidence score of each OCR engine on the metadata field, and if the confidence score cannot be determined, the extraction is returned to the result based on the regular expression.
- 9. The method for multi-dimensional PDF document parsing and metadata extraction of claim 1, wherein the three-layer progressive rule in step 8 includes: A first layer that comprehensively utilizes title features, format features, and location features to identify reference regions; A second layer of splitting the continuous text into individual single references based on sequence number patterns or head of line alignment features within the identified reference region; third layer, for each reference, regular expressions and heuristic rules are applied to extract author, title, provenance, year, DOI/ISBN subfields.
- 10. The method according to claim 1, wherein in the step 9, if the system fails to identify any known type, the document is defined as a "general document", and the document title is extracted and the complete identified text content is provided.
Description
Multidimensional PDF document analysis and metadata extraction method Technical Field The invention relates to the technical field of data processing, in particular to a multidimensional PDF document analysis and metadata extraction method which is suitable for intelligent analysis and metadata extraction of multi-language and multi-type PDF documents. Background With the rapid development of artificial intelligence, particularly large language model technology, massive high-quality text data becomes a key for model training and optimization. Academic papers and patent literature are extremely valuable sources of training corpus and data as a high degree of coagulation of human knowledge. However, PDF, which is a standard format for academic literature propagation, is itself designed for reading and printing, and generally lacks machine-readable structured semantic information. The automated and highly accurate extraction of structured information such as metadata (e.g., title, author, abstract, keywords, publication date, etc.), text content, and references from unstructured PDF documents is a long-standing and complex technical challenge. Although the prior art schemes provide multiple paths for PDF analysis, the PDF analysis method has obvious short plates when being applied singly, and is difficult to meet the application requirements of large scale, high precision and multiple scenes at present. For example, the special parsing tool (such as Grobid) has higher efficiency in processing English text type PDF with similar training corpus, but has the problems of incompatibility of languages and incomplete extraction of metadata when facing to Chinese documents or PDF documents with complex layout, the large language model has strong understanding capability and can cross language and format barriers, but can generate high calculation cost and time expenditure when being directly used for parsing long documents, and is difficult to be suitable for mass data processing, and the traditional OCR technology can realize recognition of multi-language characters, but has the defects of lacking understanding of document logic structures, and only relies on fixed rules to extract information, so that the precision is low, the generalization capability is poor, and the abundant metadata, the reference documents and other complex structures cannot be reliably extracted. These drawbacks together constitute a technical bottleneck for automated, low cost acquisition of high quality structured data from diverse PDF documents. Disclosure of Invention 1. Technical problem to be solved by the invention In view of the shortcomings of the prior art, the invention provides a multi-dimensional PDF document analysis and metadata extraction method, and designs a layered, collaborative and self-adaptive hybrid analysis framework which intelligently selects and combines an optimal technical stack according to the internal characteristics of languages, file types and the like of documents, and aims at maximizing the accuracy and the completeness of extracted information on the premise of ensuring the processing efficiency and controlling the total cost by adopting strategies such as coarse-granularity positioning, fine-granularity extraction, rule priority, model bottom, multi-source cross-validation and the like. 2. Technical proposal In order to achieve the above purpose, the technical scheme provided by the invention is as follows: the invention discloses a multi-dimensional PDF document analysis and metadata extraction method, which comprises the following steps: step 1, preprocessing and page streaming are carried out on a PDF document, each page of the PDF document is independently converted into a high-resolution byte stream or an image buffer area format, and the byte stream or the image buffer area format is recorded as an ordered list [ page-1, page-2, ], and standard and unified input is provided for subsequent analysis; Step 2, judging language attribute and type attribute of the document through sampling statistics and heuristic analysis strategy; Step 3, receiving and checking the user-defined metadata template uploaded by the user, and injecting the user-defined metadata template into a subsequent extraction link after the user-defined metadata template is checked successfully to serve as a priority basis for metadata extraction; Step 4, starting an analysis path based on Grobid for the English text type document judged in the step 2, utilizing a Grobid server to conduct efficient analysis, conducting rough extraction of metadata on an uncovered page area analyzed by Grobid, and conducting file type judgment by combining rules; Step 5, for the document judged as Chinese text type or picture type in the step 2, starting an analysis path based on an OCR engine, processing all pages by utilizing the OCR engine with layout analysis capability, performing rough extraction of metadata on an area of an OCR analysis uncovered