CN-121640501-B - File analysis method and system applied to ofd reader

CN121640501BCN 121640501 BCN121640501 BCN 121640501BCN-121640501-B

Abstract

The invention relates to the technical field of file analysis, and particularly discloses a file analysis method and a file analysis system applied to a ofd reader, wherein the method comprises the steps of performing primary clustering on pages based on a content distribution array, and acquiring structural features and content features of the pages for each type of page after the primary clustering; the method comprises the steps of carrying out secondary clustering on pages according to structural features and content features, synchronously determining analysis scores of each page according to a primary clustering process and a secondary clustering process, selecting pages according to the analysis scores, executing analysis processes, constructing simplified analysis processes of similar pages after secondary clustering based on analyzed contents, carrying out gradient analysis on each page based on the simplified analysis processes, carrying out primary recognition on ofd files, carrying out clustering on the pages according to primary recognition results, and carrying out front-end simplified analysis operation on the similar pages according to analysis parameters of the processed pages on the basis of executing the traditional analysis processes, thereby greatly improving analysis efficiency.

Inventors

ZHANG CHUN

Assignees

北京联合永道软件股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260204

Claims (5)

1. A file parsing method applied to ofd readers, the method comprising: receiving ofd files, splitting ofd files according to pages, and carrying out object analysis on the contents of each page to construct a content distribution array, wherein the content distribution array is used for representing the data positions and the data amounts of different data structures; comparing the content distribution arrays of different pages, and calculating array distances; determining the page distance according to the array distance and the page sequence difference of the page; Performing primary clustering on pages based on page distances, and acquiring structural features and content features of the pages for each type of pages subjected to primary clustering; The structure characteristics comprise page text density, effective text duty ratio, character confidence and structure complexity, and the content characteristics comprise page keyword concentration and semantic topic consistency scores; comparing the structural characteristics and the content characteristics of any two pages of each class of pages after primary clustering, and calculating the information distance; performing secondary clustering in each type of pages after primary clustering based on the information distance; reading the total number of pages of each type of pages after the secondary clustering, and determining a reference score according to the total number of pages; Determining a floating score for each page based on the structural features and the content features; Accumulating the reference score and the floating score to obtain an analysis score of each page, wherein the analysis score is used for representing the analysis priority of the page; Selecting a page according to the analysis score, executing an analysis process, and recording analysis parameters; Each time an analysis process is executed, the executed quantity of each type of pages after secondary clustering is obtained; Recursively updating the reference scores according to the executed quantity, wherein after the reference scores are updated, the analysis scores are updated along with the updating; Randomly inquiring analysis parameters corresponding to a certain analyzed page for each type of pages after secondary clustering, and constructing a simplified analysis process; And taking the simplified analysis process as a process parallel to the original analysis process, and pre-analyzing the similar pages after secondary clustering.
2. The file parsing method applied to ofd readers according to claim 1, wherein the steps of receiving ofd files, splitting ofd files by pages, performing object analysis on each page content, and constructing a content distribution array include: receiving ofd a file, traversing ofd and locating Page nodes in ofd; Splitting ofd files based on Page nodes to obtain pages containing Page sequences; Performing object recognition on each page, and determining an object type and a position, wherein the object type comprises an image and a text; mapping the position into a row and column position of a matrix, and counting object types based on the row and column position to obtain a content distribution array, wherein the content distribution array is a two-dimensional matrix.
3. The file parsing method applied to ofd readers according to claim 1, wherein the process of obtaining the structural feature is: the page text density adopts the ratio of the occupied area of the text objects in the page to the total display area of the page; The effective text ratio adopts the ratio of the number of characters of the continuous spliced text to the total number of characters of the text, wherein the number of characters of the continuous spliced text is the number of characters with the character spacing and the character line spacing smaller than a preset threshold value condition; the character confidence is determined by the inverse of the anomaly character; the structure complexity is in direct proportion to the number of special formats, wherein the special formats comprise tables, notes, footnotes and rotating texts; the acquisition process of the content characteristics comprises the following steps: the method comprises the steps of extracting keywords based on TF-IDF values, calculating the total number of the extracted keywords, and calculating the ratio of the total number of the keywords to the total number of the page words to be used as the page keyword concentration; The determining process of the semantic topic consistency score is as follows: Positioning text blocks in a page, converting the text blocks into semantic vectors, comparing the semantic vectors of the text blocks in pairs, calculating semantic similarity, and calculating the average value of all the semantic similarity to be used as a semantic topic consistency score; The determination process of the floating score comprises the following steps: And counting structural features and content features based on preset weights to obtain floating scores, wherein the floating scores are in direct proportion to page text density, effective text duty ratio, character confidence, page keyword concentration and semantic topic consistency scores, and are in inverse proportion to structural complexity.
4. A file parsing system for ofd readers, the system comprising: the distribution condition determining module is used for receiving ofd files, splitting ofd files according to pages, and carrying out object analysis on the content of each page to construct a content distribution array, wherein the content distribution array is used for representing the data positions and the data amounts of different data structures; the page feature extraction module is used for performing primary clustering on the pages based on the content distribution array, and acquiring structural features and content features of the pages for each type of page after the primary clustering; The analysis score determining module is used for carrying out secondary clustering on the pages according to the structural characteristics and the content characteristics, and synchronously determining the analysis score of each page according to the primary clustering process and the secondary clustering process, wherein the analysis score is used for representing the analysis priority of the pages; The gradient processing module is used for selecting pages according to the analysis scores, executing an analysis process, constructing a simplified analysis process of similar pages after secondary clustering based on the analyzed contents, and carrying out gradient analysis on each page based on the simplified analysis process; The page feature extraction module comprises: The array comparison unit is used for comparing the content distribution arrays of different pages and calculating array distances; the page distance determining unit is used for determining the page distance according to the array distance and the page sequence difference of the page; The extraction execution unit is used for performing primary clustering on the pages based on the page distance, and acquiring structural features and content features of the pages for each type of page after primary clustering; The structure characteristics comprise page text density, effective text duty ratio, character confidence and structure complexity, and the content characteristics comprise page keyword concentration and semantic topic consistency scores; The analytic score determination module comprises: The information distance calculation unit is used for comparing the structural characteristics and the content characteristics of any two pages for each class of pages subjected to primary clustering, and calculating the information distance; The secondary clustering unit is used for carrying out secondary clustering in each type of page after primary clustering based on the information distance; The reference score determining unit is used for reading the total number of pages of each type of pages after the secondary clustering and determining a reference score according to the total number of pages; a floating score determining unit for determining a floating score of each page according to the structural feature and the content feature; The score accumulation output unit is used for accumulating the reference score and the floating score to obtain the analysis score of each page; The method comprises the steps of selecting pages according to analysis scores, executing an analysis process, constructing a simplified analysis process of similar pages after secondary clustering based on analyzed contents, and carrying out gradient analysis on the pages based on the simplified analysis process, wherein the contents comprise the following steps: Selecting a page according to the analysis score, executing an analysis process, and recording analysis parameters; Each time an analysis process is executed, the executed quantity of each type of pages after secondary clustering is obtained; Recursively updating the reference scores according to the executed quantity, wherein after the reference scores are updated, the analysis scores are updated along with the updating; Randomly inquiring analysis parameters corresponding to a certain analyzed page for each type of pages after secondary clustering, and constructing a simplified analysis process; And taking the simplified analysis process as a process parallel to the original analysis process, and pre-analyzing the similar pages after secondary clustering.
5. The file parsing system applied to ofd readers as claimed in claim 4, wherein the distribution determining module includes: the node positioning unit is used for receiving ofd files, traversing ofd files and positioning Page nodes in ofd files; the file splitting unit is used for splitting ofd files based on Page nodes to obtain pages containing Page sequences; The object recognition unit is used for carrying out object recognition on each page and determining an object type and a position, wherein the object type comprises an image and a text; And the array generating unit is used for mapping the positions into row and column positions of the matrix, counting object types based on the row and column positions and obtaining a content distribution array, wherein the content distribution array is a two-dimensional matrix.

Description

File analysis method and system applied to ofd reader Technical Field The invention relates to the technical field of file analysis, in particular to a file analysis method and a file analysis system applied to ofd readers. Background Ofd (Open Fixed-layout Document) is a established and public general Open format Document standard, is mainly used for scenes such as electronic documents, archive storage and exchange, provides a Document format which is not dependent on specific software or hardware, ensures consistency of display effects of documents between different devices and platforms, ofd files not only contain structured data, but also support rich graphic elements, can accurately define layout and content of the documents, including texts, pictures, vector graphics, comments and the like, the text in the format has extremely strong universality, the files are involved in most interaction scenes, ofd files have some defects, the ofd files need to be analyzed and then displayed on different devices, analysis speed is slow when ofd files are large, user experience is affected, sometimes users even because the files are relatively unfamiliar, the files can be considered to be deleted when the analysis speed is slow, and therefore, how to improve the analysis speed of ofd files is a technical scheme which is intended to solve. Disclosure of Invention The invention aims to provide a file analysis method and a file analysis system applied to ofd readers, so as to solve the problems in the background technology. In order to achieve the above purpose, the present invention provides the following technical solutions: a file analysis method and system applied to ofd readers, the method includes: receiving ofd files, splitting ofd files according to pages, and carrying out object analysis on the contents of each page to construct a content distribution array, wherein the content distribution array is used for representing the data positions and the data amounts of different data structures; Performing primary clustering on the pages based on the content distribution array, and acquiring structural features and content features of the pages for each type of page after primary clustering; Carrying out secondary clustering on the pages according to the structural features and the content features, and synchronously determining analysis scores of each page according to a primary clustering process and a secondary clustering process, wherein the analysis scores are used for representing analysis priorities of the pages; and selecting pages according to the analysis scores, executing an analysis process, constructing a simplified analysis process of similar pages after secondary clustering based on the analyzed contents, and carrying out gradient analysis on each page based on the simplified analysis process. The invention further provides a method for receiving ofd files, splitting ofd files according to pages, performing object analysis on each page of content, and constructing a content distribution array, wherein the method comprises the following steps: receiving ofd a file, traversing ofd and locating Page nodes in ofd; Splitting ofd files based on Page nodes to obtain pages containing Page sequences; Performing object recognition on each page, and determining an object type and a position, wherein the object type comprises an image and a text; mapping the position into a row and column position of a matrix, and counting object types based on the row and column position to obtain a content distribution array, wherein the content distribution array is a two-dimensional matrix. The invention further provides a method for acquiring structural features and content features of pages, which comprises the steps of: comparing the content distribution arrays of different pages, and calculating array distances; determining the page distance according to the array distance and the page sequence difference of the page; Performing primary clustering on pages based on page distances, and acquiring structural features and content features of the pages for each type of pages subjected to primary clustering; the structural features comprise page text density, effective text duty ratio, character confidence and structural complexity, and the content features comprise page keyword concentration and semantic topic consistency scores. The invention further provides a method for determining the analysis score of each page according to the primary clustering process and the secondary clustering process, wherein the step of performing secondary clustering on the pages according to the structural features and the content features comprises the following steps: comparing the structural characteristics and the content characteristics of any two pages of each class of pages after primary clustering, and calculating the information distance; performing secondary clustering in each type of pages after primary clustering based on the informa