CN-121996802-A - Method for quickly retrieving key content of paper file

CN121996802ACN 121996802 ACN121996802 ACN 121996802ACN-121996802-A

Abstract

A method for quickly searching the key contents of paper document features that the OCR technology is used to realize the electronic conversion of paper document, the word frequency statistics, word meaning analysis, document theme analysis and table theme analysis of electronic document are realized by the pre-training model deployed off-line, the key word list and theme list are obtained, a "key word-theme-page number" correspondent relation is generated, a two-dimensional table is finally generated, the key contents are used as the columns of table, the key theme is used as the rows of table, the page number is recorded at the intersection point of rows and columns, and the reader can quickly position the key contents of article and the page number where its correspondent theme is by looking up said two-dimensional table. The invention designs a set of document key content lookup table which is uniformly generated by software, a document key content lookup table which is formed by one or more pages is newly added between the catalogue and the text of the paper file, readers know the key content and key theme of the document through the table, and interested content and the corresponding theme are rapidly positioned in a table lookup mode, so that the document content retrieval efficiency is improved.

Inventors

Mou Jiazheng
ZHU DAPENG
LIU CAIYUN
FENG JINGJING
ZHAO XIXIANG
WANG CHENG
SHEN DACHENG
Lin Rongfei

Assignees

中船数字信息技术有限公司

Dates

Publication Date: 20260508
Application Date: 20260121

Claims (8)

1. A method for quickly searching the key contents of paper document features that the OCR technology is used to realize the electronic conversion of paper document, the word frequency statistics, word meaning analysis, document theme analysis and table theme analysis of electronic document are realized by the pre-training model deployed off-line, the keyword list and theme list are obtained, the relation between keyword and theme and page number is generated, a two-dimensional table is finally generated, the key contents are used as the columns of table, the key theme is used as the rows of table, the page number is recorded on the crossing point of rows and columns, and the reader can quickly position the key contents of article and the page number where the corresponding theme is located by looking up said two-dimensional table.
2. The method for quickly searching the key contents of the paper file according to claim 1, which is characterized by comprising the following specific steps: (1) Identifying the paper document as an electronic document using OCR technology; (2) Performing word segmentation operation and word frequency statistics on the electronic document by using a deep neural network; (3) Performing word sense analysis to generate a keyword list; (4) Selecting the most suitable pre-training model to carry out subsequent topic analysis work; (5) Performing text paragraph and table content topic analysis by using a large model technology, generating a text topic list and recording the page number range of each topic; (6) And generating a document key content lookup table, outputting a two-dimensional table, taking a keyword list as a column, taking a topic list as a row, and marking page numbers of topics where corresponding keywords are located on intersection points of the rows and the columns.
3. The method of claim 1, wherein in step (2), the deep neural network is used to perform word segmentation and word frequency statistics on the electronic document, the content in the text is split into meaningful words or phrases, the occurrence frequency of various words is counted, and the keyword candidate list is obtained by sorting the words according to the order from top to bottom.
4. The method of claim 1, wherein in the step (4), the pre-training model is a model which is trained in advance by a large amount of data and can be deployed offline for application on electronic document processing, and the model comprises a model for word segmentation and word frequency statistics and a model for topic analysis.
5. The method of claim 1, wherein in step (5), the topic analysis of text paragraphs and forms is performed by using a large model technique, the topic of each text paragraph or form is determined, a text topic list is generated, and the page range of each topic is recorded.
6. The method of claim 5, wherein the topic of each text or form determined in step (5) comprises a content profile, a concept definition, a quotation, a term requirement, and a technical requirement.
7. The method of claim 1, wherein in step (6), the large language model technique is used to determine the content of each word in the keyword list in the topic list, record the page position of the keyword, and finally generate a document keyword lookup table consisting of "keyword-topic-page".
8. The method of claim 7, wherein the final table in step (6) is a two-dimensional table, the keyword list is used as a column, the topic list is used as a row, and the intersection points of the rows and the columns are marked with page numbers of topics where the corresponding keywords are located.

Description

Method for quickly retrieving key content of paper file Technical Field The invention relates to the technical field of document information retrieval, in particular to a method for rapidly retrieving key contents of a paper file. Background Currently, when writing an article, an author usually presents main contents of an article outline and a paragraph to a reader in the form of a content catalog, informs the reader of a chart position in the form of a chart catalog, and intuitively presents a subject of a chart to the reader through the name of the chart. However, in the current document content retrieval positioning implementation process based on the document catalogue and the chart catalogue, catalogue entries only provide the main content of the article paragraphs, so that the generality is strong, and the main content of the document, particularly the appearance position of certain key information, is difficult for readers to intuitively know from the catalogue except the document author. When searching content, readers need to infer the content of the article paragraph and the position of the key information of interest of the readers according to own experience and directory entries, and then manually page the relevant paragraph to review the information of interest. Because of the difference of the outline structures of various articles, when readers unfamiliar with specific articles read the articles, the searching of interested key information often takes a certain time to perform the processes of multiple times of presumption and page turning and reference, the structures of the articles are gradually familiar in the process, and finally, one or more positions where the interested information appears are found, so that the content searching and positioning efficiency is lower. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a method for rapidly searching the key contents of a paper file, which designs a set of document key content lookup table uniformly generated by software, and newly adds a document key content lookup table formed by one or more pages between the directory and the text of the paper file, readers know the key contents and key topics of the document through the table, rapidly locate interested contents and corresponding topics thereof in a table lookup mode, and improve the document content searching efficiency. The technical problems to be solved by the invention are realized by the following technical proposal. The invention relates to a method for quickly searching key contents of a paper document, which utilizes an OCR technology to realize the electronic conversion of the paper document, realizes word frequency statistics, word meaning analysis, document theme analysis and table theme analysis of the electronic document through an off-line deployed pre-training model, obtains a keyword list and a theme list, generates a corresponding relation of keywords, themes and page numbers, finally generates a two-dimensional table, takes the key contents as columns of the table, takes the key themes as rows of the table, records page numbers on the intersection points of the rows and the columns, and enables readers to quickly position the key contents of articles and the page numbers of the themes corresponding to the key contents by checking the two-dimensional table. The technical problem to be solved by the invention can be further realized by the following technical scheme, and the method for rapidly searching the key content of the paper file comprises the following specific steps: (1) Identifying the paper document as an electronic document using OCR technology; (2) Performing word segmentation operation and word frequency statistics on the electronic document by using a deep neural network; (3) Performing word sense analysis to generate a keyword list; (4) Selecting the most suitable pre-training model to carry out subsequent topic analysis work; (5) Performing text paragraph and table content topic analysis by using a large model technology, generating a text topic list and recording the page number range of each topic; (6) And generating a document key content lookup table, outputting a two-dimensional table, taking a keyword list as a column, taking a topic list as a row, and marking page numbers of topics where corresponding keywords are located on intersection points of the rows and the columns. The technical problem to be solved by the invention can be further solved by the following technical scheme, for the method for rapidly searching the key content of the paper file, in the step (2), the deep neural network is utilized to perform word segmentation operation and word frequency statistics on the electronic file, the content in the text is split into meaningful words or phrases, the occurrence frequency of various words is counted, and the key word candidate list is obtained by sequencing according to the sequence from top to b