US-12626058-B2 - Systems and methods for structure and header extraction
Abstract
The present disclosure is directed towards systems and methods for extracting structure and headers from a body of text. This computational extraction is based on the visual and logical similarities between portions of text. Boilerplate is removed from chunks of text making up potential headers and the cleaned result is compared against other potential headers and the remainder of the body of text.
Inventors
- Richard Anthony Pito
Assignees
- THOMSON REUTERS ENTERPRISE CENTRE GMBH
Dates
- Publication Date
- 20260512
- Application Date
- 20231030
Claims (18)
- 1 . A system comprising memory and one or more processors communicatively coupled to the memory, the one or more processors configured to: classify a text portion of a document as a potential header portion of the document based at least in part on a set of features of the text portion; identify a boilerplate text sequence in the potential header based at least in part on a comparison between text sequences of the text portion; remove the boilerplate text sequence from the text portion to form a header remainder portion of the potential header portion; determine a similarity score among the header remainder portion and one or more other header remainder portions of the document based at least in part on a comparison between an average number of characters in a group of potential header portions with similar features to a defined threshold related to a number of character edits required to transform each potential header portion in the group into a subsequent potential header portion in the group; confirm whether the potential header portion is a header portion of the document based at least in part on the similarity score; and determine a tree-structured hierarchy for the document in response to a determination that the potential header portion corresponds to a header portion of the document, including setting a depth and parent of the potential header portion.
- 2 . The system of claim 1 , wherein the one or more processors are further configured to: execute a document analysis process for the document in response to a determination that the potential header portion corresponds to a header portion of the document.
- 3 . The system of claim 1 , wherein the set of features include a defined text marking.
- 4 . The system of claim 1 , wherein the set of features include typography characteristics.
- 5 . The system of claim 1 , wherein the set of features include at least two or more of font family, font size, italic, bold, underline, space above, space left, space left first line, and justification.
- 6 . The system of claim 1 , wherein the set of features include orthography characteristics.
- 7 . The system of claim 1 , wherein the set of features include page layout.
- 8 . The system of claim 1 , wherein the set of features include at least two or more of typography characteristics, orthography characteristics and page layout.
- 9 . A method, comprising: classifying a text portion of a document as a potential header portion of the document based at least in part on a set of features of the text portion; identifying a boilerplate text sequence in the potential header based at least in part on a comparison between text sequences of the text portion; removing the boilerplate text sequence from the text portion to form a header remainder portion of the potential header portion; determining a similarity score among the header remainder portion and one or more other header remainder portions of the document based at least in part on a comparison between an average number of characters in a group of potential header portions with similar features to a defined threshold related to a number of character edits required to transform each potential header portion in the group into a subsequent potential header portion in the group; and confirming whether the potential header portion is a header portion of the document based at least in part on the similarity score; and determining a tree-structured hierarchy for the document in response to a determination that the potential header portion corresponds to a header portion of the document, including setting a depth and parent of the potential header portion.
- 10 . The method of claim 9 , further comprising: executing a document analysis process for the document in response to a determination that the potential header portion corresponds to a header portion of the document.
- 11 . The method of claim 9 , wherein the set of features include a defined text marking.
- 12 . The method of claim 9 , wherein the set of features include typography characteristics.
- 13 . The method of claim 9 , wherein the set of features include at least two or more of font family, font size, italic, bold, underline, space above, space left, space left first line, and justification.
- 14 . The method of claim 9 , wherein the set of features include orthography characteristics.
- 15 . The method of claim 9 , wherein the set of features include page layout.
- 16 . The method of claim 9 , wherein the set of features include at least two or more of typography characteristics, orthography characteristics and page layout.
- 17 . A computer program product, stored on a non-transitory computer readable medium, comprising instructions that when executed by one or more processors cause the one or more processors to: classify a text portion of a document as a potential header portion of the document based at least in part on a set of features of the text portion; identify a boilerplate text sequence in the potential header based at least in part on a comparison between text sequences of the text portion; remove the boilerplate text sequence from the text portion to form a header remainder portion of the potential header portion; determine a similarity score among the header remainder portion and one or more other header remainder portions of the document based at least in part on a comparison between an average number of characters in a group of potential header portions with similar features to a defined threshold related to a number of character edits required to transform each potential header portion in the group into a subsequent potential header portion in the group; and confirm whether the potential header portion is a header portion of the document based at least in part on the similarity score; and determine a tree-structured hierarchy for the document in response to a determination that the potential header portion corresponds to a header portion of the document, including setting a depth and parent of the potential header portion.
- 18 . The computer program product of claim 17 , wherein the one or more processors are further configured to: execute a document analysis process for the document in response to a determination that the potential header portion corresponds to a header portion of the document.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of U.S. patent application Ser. No. 17/156,544, filed Jan. 23, 2021, which claims the benefit of and priority to U.S. Provisional Application Nos. 62/965,516, filed Jan. 24, 2020; 62/965,520, filed Jan. 24, 2020; 62/965,523, filed Jan. 24, 2020; and 62/975,514, filed Feb. 12, 2020, each of which are hereby incorporated by reference in their entireties. This application for letters patent disclosure document describes inventive aspects that include various novel innovations (hereinafter “disclosure”) and contains material that is subject to copyright, mask work, and/or other intellectual property protection. The respective owners of such intellectual property have no objection to the facsimile reproduction of the disclosure by anyone as it appears in published Patent Office file/records, but otherwise reserve all rights. BACKGROUND The present innovations generally address tools for extracting structure and header information from documents. Large professional documents such as those found in the legal domain are normally hierarchically structured into sections which contain sub-sections which further contain sub-sub-sections and so on. In addition, each of these sections may contain lists, with sub-lists, etc. This structure can convey important information when analyzing a document for many downstream tasks such as information retrieval, information extraction, document presentation and/or document navigation. Using a computer to reliably extract a document's structure for real world documents is challenging not only because many documents don't follow a consistent template but also because of errors introduced by document conversion and user error. Furthermore, the structure of a document can be obscured by boilerplate text such as page headers and footers that are captured during an optical character recognition (“OCR”) process and must be reliably identified and removed. The existing literature about document structure analysis can be roughly divided into the identification of physical, logical and/or semantic structure. See Dengel and Shafait (Andreas Dengel and Faisal Shafait. [n.d.]. Analysis of the Logical Layout of Documents. In Handbook of Document Image Processing and Recognition, David Doermann and Karl Tombre (Eds.). Springer London, 177-222.) and Mao et. al (Song Mao, Azriel Rosenfeld, and Tapas Kanungo. [n.d.]. Document structure analysis algorithms: a literature survey, Tapas Kanungo, Elisa H. Barney Smith, Jianying Hu, and Paul B. Kantor (Eds.). 197-207.) for reviews, both of which are incorporated herein in their entireties. Physical structure extraction deals with capturing a digital representation of a paper document and involves image processing/enhancement, grouping the pixels of an image of a document into sections, identifying the type of each section (e.g. text or image) and performing OCR on text sections. Logical structure analysis involves identifying relationships between physical components, e.g. the caption of a figure, the agglomeration of coherent sections of text, the document's reading order and possibly its section hierarchy. Logical structure analysis may be performed on natively digital documents where structure information is not readily available as in PDF documents. Semantic analysis normally involves identifying section types specific to a certain domain although this is sometimes grouped under logical structure analysis. These processes are generally applied sequentially and errors in one process can accumulate in downstream processes. The present inventions may fall into the domain of logical structure analysis and take as input text blocks in reading order that are annotated with layout and formatting information and produces a hierarchy of sections and/or list items in the form of a tree. The present inventions deal, therefore, not only with scanned documents but natively electronic documents that do not have structure annotations. Tuarob et. al. (S. Tuarob, P. Mitra, and C. L. Giles. [n.d.]. A hybrid approach to discover semantic hierarchical sections in scholarly documents. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR) (2015-08). 1081-1085.) identify and classify sections and creates a hierarchy using domain specific rules for scholarly articles. Constantin et. al. (Alexandru Constantin, Steve Pettifer, and Andrei Voronkov. [n.d.]. PDF: fully-automated PDF-to-XML conversion of scientific literature. ACM Press, 177.) identify the logical parts of scientific documents using rules based on some font characteristics. While both of these use font characteristics to identify section headings and/or boundaries, neither is completely sufficient. Rahman and Finin (Muhammad Mahbubur Rahman and Tim Finin. [n.d.]. Understanding the Logical and Semantic Structure of Large Documents. ([n.d.]). arXiv: 1709.00770) also work in the domain of scholarly articles, howev