US-12619828-B2 - Reading order detection in a document

US12619828B2US 12619828 B2US12619828 B2US 12619828B2US-12619828-B2

Abstract

According to embodiments of the present disclosure, there is provided a solution for reading order detection in a document. In the solution, a computer-implemented method includes: determining a text sequence and layout information presented in a document, the text sequence comprising a plurality of text elements, the layout information indicating a spatial layout of the plurality of text elements in the document; generating a plurality of semantic feature representations corresponding to the plurality of text elements based at least on the text sequence and the layout information; and determining a reading order of the plurality of text elements in the document based on the plurality of semantic feature representations. According to the solution, the introduction of the layout information can better characterize a spatial layout manner of the text elements in a specific document, thereby determining the reading order more effectively and accurately.

Inventors

Lei Cui
Yiheng XU
Yang Xu
Furu Wei
Zilong Wang

Assignees

MICROSOFT TECHNOLOGY LICENSING, LLC

Dates

Publication Date: 20260505
Application Date: 20220523
Priority Date: 20210630

Claims (13)

1 . A computer-implemented method comprising: executing a feature extraction neural network model that has been trained using example documents to determine a text sequence and layout information presented in a document, the text sequence comprising a plurality of text elements, the layout information indicating a spatial layout of the plurality of text elements in the document; generating a plurality of semantic feature representations corresponding to the plurality of text elements based at least on the text sequence and the layout information; wherein using the trained feature extraction neural network model to generate the plurality of semantic feature representations comprises: converting the text sequence and the layout information into a first embedding representation and a second embedding representation, respectively; concatenating the first embedding representation and the second embedding representation, to obtain a concatenated embedding representation; and applying the concatenated embedding representation into the trained feature extraction neural network model to generate the plurality of semantic feature representations; and determining a reading order of the plurality of text elements in the document based on the plurality of semantic feature representations provided by the trained feature extraction neural network model.
2 . The method of claim 1 , wherein generating the plurality of semantic feature representations comprises: for a first text element in the text sequence, determining an attention weight for the first text element with respect to a second text element in the text sequence based on at least one of the following: a relative spatial positioning of the first text element with respect to the second text element in the document, and a relative ranking position of the first text element with respect to the second text element in the text sequence, the attention weight indicating an importance degree of the second text element to the first text element; and determining a semantic feature representation of the first text element by weighting an embedding representation of the second text element with the determined attention weight.
3 . The method of claim 1 , wherein generating the plurality of semantic feature representations comprises: determining an image-format file corresponding to the document; determining visual information from the image-format file, the visual information indicating visual appearances of the plurality of text elements presented in the document; and generating the plurality of semantic feature representations further based on the visual information.
4 . A computer-implemented method comprising: executing a feature extraction neural network model that has been trained using example documents to determine a text sequence, layout information and order labeling information presented in a first sample document, the text sequence comprising a first set of text elements, the layout information indicating a spatial layout of the first set of text elements in the first sample document, the order labeling information indicating a ground-truth reading order of the first set of text elements in the first sample document; generating, using the feature extraction neural network model, respective semantic feature representations of the first set of text elements based at least on the text sequence and the layout information; executing an order determination neural network model to determine a predicted reading order of the first set of text elements in the first sample document based on the semantic feature representations; and training the feature extraction neural network model and order determination neural network model based on a difference between the predicted reading order and the ground-truth reading order.
5 . The method of claim 4 , wherein the first sample document comprises an editable text document, and determining the order labeling information comprises: determining format information corresponding to the editable text document, the format information at least indicating the ground-truth reading order of the first set of text elements.
6 . The method of claim 4 , wherein determining the layout information comprises: determining a vector file corresponding to the first sample document; and determining the layout information of the first set of text elements from the vector file.
7 . The method of claim 6 , wherein a plurality of text elements that occur at different positions in the first sample document and represent a same text are assigned with different indices, and the plurality of text elements are labeled with different colors in the vector file, the color with which each text element is labeled being determined based on the index assigned to the text element; and wherein determining the layout information of the first set of text elements from the vector file comprises: assigning, based on the indices and the colors assigned to the plurality of text elements, layout information determined from the vector file to the plurality of text elements extracted from the first sample document.
8 . The method of claim 4 , wherein generating the semantic feature representations comprises: determining visual information from a first image-format file corresponding to the first sample document, the visual information representing visual appearances of the first set of text elements presented in the first sample document; and generating, using the feature extraction neural network model, the semantic feature representations further based on the visual information.
9 . The method of claim 4 , further comprising obtaining the pre-trained feature extraction neural network model by: determining a second image-format file corresponding to a second sample document, the second sample document comprising a second set of text elements; generating, by masking at least one text element of the second set of text elements in the second image-format file, first masking information to indicate that the at least one text element is masked and other text elements of the second set of text elements are not masked; determining, using the feature extraction neural network model, respective semantic feature representations of the second set of text elements; determining second masking information based on the respective semantic feature representations of the second set of text elements, the second masking information indicating whether respective text elements of the second set of text elements are masked; and pre-training the feature extraction model based on a difference between the first masking information and the second masking information.
10 . The method of claim 8 , wherein the feature extraction neural network model is further configured to generate a first visual feature representation of the first image-format file based on the text sequence, the layout information and the visual information, wherein the method further comprises obtaining the pre-trained feature extraction neural network model by: determining a third sample document, a third image-format file, and match labeling information, the match labeling information indicating whether the third image-format file matches with the third sample document; generating, using the feature extraction neural network model, a second visual feature representation of the third image-format file based on the third sample document and the third image-format file; determining, based on the second visual feature representation, a match result indicating whether the third image-format file matches with the third sample document; and pre-training the feature extraction neural network model based on a difference between the match result and the match labeling information.
11 . An electronic device, comprising: a processor, and a memory coupled to the processor and having instructions stored thereon, the instructions, when executed by the processor, causing the device to perform acts comprising: executing a feature extraction neural network model that has been trained using example documents to determine a text sequence and layout information presented in a document, the text sequence comprising a plurality of text elements, the layout information indicating a spatial layout of the plurality of text elements in the document; generating a plurality of semantic feature representations corresponding to the plurality of text elements based at least on the text sequence and the layout information; wherein using the trained feature extraction neural network model to generate the plurality of semantic feature representations comprises: for a first text element in the text sequence, determining an attention weight for the first text element with respect to a second text element in the text sequence based on at least one of the following: a relative spatial positioning of the first text element with respect to the second text element in the document, and a relative ranking position of the first text element with respect to the second text element in the text sequence, the attention weight indicating to the trained feature extraction neural network model an importance degree of the second text element to the first text element; and determining a semantic feature representation of the first text element by weighting an embedding representation of the second text element with the determined attention weight; and determining a reading order of the plurality of text elements in the document based on the plurality of semantic feature representations provided by the trained feature extraction neural network model.
12 . An electronic device, comprising: a processor; and a memory coupled to the processor and having instructions stored thereon, the instructions, when executed by the processor, causing the device to perform acts comprising: executing a feature extraction neural network model that has been trained using example documents to determine a text sequence, layout information and order labeling information presented in a first sample document, the text sequence comprising a first set of text elements, the layout information indicating a spatial layout of the first set of text elements in the first sample document, the order labeling information indicating a ground-truth reading order of the first set of text elements in the first sample document; generating, using the feature extraction neural network model, respective semantic feature representations of the first set of text elements based at least on the text sequence and the layout information; executing an order determination neural network model to determine a predicted reading order of the first set of text elements in the first sample document based on the semantic feature representations; and training the feature extraction neural network model and order determination neural network model based on a difference between the predicted reading order and the ground-truth reading order.
13 . The device of claim 12 , wherein the first sample document comprises an editable text document, and determining the order labeling information comprises: determining format information corresponding to the editable text document, the format information at least indicating the ground-truth reading order of the first set of text elements.

Description

CLAIM OF PRIORITY This application is a U.S. National Stage Filing under 35 U.S.C. 371 of International Patent Application Serial No. PCT/US2022/030466, filed May 23, 2022, and published as WO 2023/278072 A1 on Jan. 5, 2023, which claims the benefit of priority to Chinese Patent Application No. 202110739466.3, filed Jun. 30, 2021, which applications and publication are incorporated herein by reference in their entirety. BACKGROUND Document understanding is a popular research field and intended to automatically read, understand and analyze a document. A document may include an electronically-generated document or scanned document, such as an image, an electronic file, a handwritten scanned document and so on. Understanding and analyzing a document, especially a business document, may greatly improve people's daily life and improve business efficiency and production. Rich-text documents might exist in many applications. As compared with plain-text documents, various types of information in rich-text documents are arranged in a more flexible format and layout, thereby having a rich visual presentation effect. Examples of rich-text documents include various forms, invoices, receipts, financial statements, advertising documents, etc. Although various types of documents contain different forms of information, part of the information is usually presented in a natural language. Therefore, document understanding involves natural language processing (NLP), especially learning a semantic feature representation of the textual information presented by the document. In a specific application of document understanding, the reading order of a text sequence is an important task. The reading order describes the text sequence information that is naturally understood by human beings. However, it is challenging to determine the reading order in some documents, especially in rich-text documents. SUMMARY According to embodiments of the present disclosure, there is provided a solution for reading order detection in a document. In the solution, a computer-implemented method includes: determining a text sequence and layout information presented in a document, the text sequence comprising a plurality of text elements, the layout information indicating a spatial layout of the plurality of text elements in the document; generating a plurality of semantic feature representations corresponding to the plurality of text elements based at least on the text sequence and the layout information; and determining a reading order of the plurality of text elements in the document based on the plurality of semantic feature representations. According to the solution, the introduction of the layout information can better characterize a spatial layout manner of the text elements in a specific document, thereby determining the reading order more effectively and accurately. The Summary is to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. The Summary is not intended to identify key features or essential features of the subject matter described herein, nor is intended to be used to limit the scope of the subject matter described herein. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 illustrates a block diagram of an environment in which various embodiments of the present disclosure can be implemented; FIG. 2 illustrates an example document in accordance with some embodiments of the present disclosure; FIG. 3 illustrates a block diagram of a model architecture for reading order detection in accordance with some embodiments of the present disclosure; FIG. 4 illustrates an example of an input embedding representation of a feature extraction model in accordance with some embodiments of the present disclosure; FIG. 5 illustrates an example of labeling a text reading order in a document in accordance with some embodiments of the present disclosure; FIG. 6 illustrates an example architecture for training a reading order detection model in accordance with some embodiments of the present disclosure; FIG. 7 illustrates an example of labeling a sample document upon training in accordance with some embodiments of the present disclosure; FIG. 8 illustrates an example of self-attention masking in accordance with some embodiments of the present disclosure; FIG. 9 illustrates an example architecture of pre-training of the feature extraction model in accordance with some embodiments of the present disclosure; FIG. 10 illustrates a flow chart of a process of reading order detection in accordance with some embodiments of the present disclosure; FIG. 11 illustrates a flow chart of a process for model training in accordance with some embodiments of the present disclosure; and FIG. 12 illustrates a block diagram of a computing device that can achieve some embodiments of the present disclosure. Throughout the drawings, the same or similar reference symbols refer to the same or similar elements. DETAILED DESCRIPTION