US-20260127377-A1 - Document Entity Extraction Using Machine-Learned Models

US20260127377A1US 20260127377 A1US20260127377 A1US 20260127377A1US-20260127377-A1

Abstract

Systems and methods for performing document entity extraction are described herein. The method can include receiving an inference document and a target schema. The method can also include generating one or more document inputs from the inference document and one or more schema inputs from the target schema. The method can further include, for each combination of the document input and schema input, obtaining one or more extraction inputs by generating a respective extraction input based on the combination, providing the respective extraction input to the machine-learned model, and receiving a respective output of the machine-learned model based on the respective extraction. The method can also include validating the extracted entity data based on reference spatial locations and inference spatial locations and outputting the validated extracted entity data.

Inventors

Vincent Perot
Nikolay Alexeevich Glushnev
Nan Hua
Yun-hsuan Sung
Michael Yiupun KWONG
Florian Luisier
Kai Kang
Ramya Sree Boppana
Jiaqi Mu
Xiaoyu Sun
Carl Elie Saroufim
Guolong Su
Hao Zhang

Assignees

GOOGLE LLC

Dates

Publication Date: 20260507
Application Date: 20260105

Claims (20)

1 .- 20 . (canceled)
21 . A computer-implemented method for performing document entity extraction, the method comprising: receiving, by a computing system comprising a processor, an inference document, wherein the inference document comprises document data and one or more reference location tags respectively indicating one or more reference spatial locations of the document data within the inference document; obtaining one or more extraction outputs by: generating, by the computing system, a prompt including one or more reference spatial locations; providing, by the computing system, the prompt to a machine-learned model; and receiving, by the computing system, an output of the machine-learned model based on the prompt, wherein the output comprises entity data and one or more inference location tags corresponding to one or more inference spatial locations of the entity data within the inference document; validating, by the computing system, the entity data based on the reference spatial locations and the inference spatial locations; and outputting, by the computing system, the entity data.
22 . The computer-implemented method of claim 21 , wherein the inference document is based on an output of an optical character recognition system, and wherein the document data includes data representing optically-recognized characters in a rendering of the inference document.
23 . The computer-implemented method of claim 22 , the method comprising: receiving, by the computing system, an image input, wherein the image input is used to validate the output of the optical character recognition system.
24 . The computer-implemented method of claim 21 , wherein the inference document is an image representation of an electronic document.
25 . The computer-implemented method of claim 21 , wherein the one or more reference location tags respectively indicating one or more reference spatial locations of the document data within a rendering of the inference document are indicative of one or more bounding boxes containing a portion of the document data.
26 . The computer-implemented method of claim 21 , wherein validating the entity data based on the reference spatial locations and the inference spatial locations comprises: performing, by the computing system, normalized string matching between the entity data and document data at the reference spatial locations in the inference document as indicated by the one or more inference location tags corresponding to one or more inference spatial locations of the entity data within the inference document; determining, by the computing system, if the entity data matches the document data; and in response to determining that the entity data matches the document data, validating, by the computing system, the entity data.
27 . The computer-implemented method of claim 26 , the method comprising: in response to determining that the entity data does not match the document data: discarding, by the computing system, the entity data.
28 . The computer-implemented method of claim 21 , the method comprising: dividing, by the computing system, a target schema into a plurality of independent branches, each branch of the plurality of independent branches representing a data entity and subentities of the data entity, wherein each independent branch of the plurality of independent branches is a schema input of the target schema.
29 . The computer-implemented method of claim 21 , wherein the prompt includes one or more extraction instructions.
30 . The computer-implemented method of claim 29 , wherein the one or more extraction instructions include a description of a spatial location.
31 . The computer-implemented method of claim 21 , the method comprising: retrieving, by the computing system, at least one document from a document corpus; and adding, by the computing system, at least a portion of the at least one document to the prompt.
32 . The computer-implemented method of claim 31 , wherein the prompt includes an extraction representation of one or more data entities extracted from the portion of the at least one document.
33 . The computer-implemented method of claim 32 , further comprising repeating the providing and receiving steps to obtain a plurality of outputs, and determining a representative value wherein determining the representative value comprises determining a majority output from the plurality of outputs.
34 . The computer-implemented method of claim 33 , wherein a confidence score is generated based on the majority output and the plurality of outputs.
35 . The computer-implemented method of claim 34 , wherein the representative value is determined based at least in part on one or more received scores from the model.
36 . A computing system for performing document entity extraction, the computing system comprising: one or more processors; and a non-transitory, computer-readable medium comprising instructions that, when executed by the one or more processors, cause the one or more processors to perform operations, the operations comprising: receiving an inference document, wherein the inference document comprises a rendering composed of a plurality of pixel values and one or more reference location tags respectively indicating one or more reference spatial locations of document data within the rendering; obtaining one or more extraction outputs by: generating a multimodal prompt including one or more reference spatial locations; providing the multimodal prompt to a machine-learned model; and receiving an output of the machine-learned model based on the multimodal prompt, wherein the output comprises entity data and one or more inference location tags corresponding to one or more inference spatial locations of the entity data within the rendering of the inference document; validating the entity data based on the reference spatial locations and the inference spatial locations; and outputting the entity data.
37 . The computing system of claim 36 , wherein validating the entity data based on the reference spatial locations and the inference spatial locations comprises: performing normalized string matching between the entity data and document data at the reference spatial locations in the rendering of the inference document as indicated by the one or more inference location tags corresponding to one or more inference spatial locations of the entity data within the rendering of the inference document; determining if the entity data matches the document data; and in response to determining that the entity data matches the document data, validating the entity data.
38 . The computing system of claim 37 , the operations comprising: in response to determining that the entity data does not match the document data: discarding the entity data.
39 . A computer-implemented method for performing document entity extraction, the method comprising: receiving an inference document and a target schema; subdividing the inference document into a plurality of document chunks and the target schema into a plurality of schema inputs; generating a prompt for each document chunk, wherein each prompt is provided to a machine-learned model for a plurality of iterations to obtain a set of K completions for each document chunk; evaluating a consistency metric across the K completions for each respective document chunk to select a representative value; and outputting the selected representative values for each respective document chunk.

Description

PRIORITY CLAIM The present application is a continuation of U.S. application Ser. No. 18/453,236 having a filing date of Aug. 21, 2023. Applicant claims priority to and the benefit of each of such applications and incorporate all such applications herein by reference in its entirety FIELD The present disclosure relates generally to document entity extraction. More particularly, the present disclosure relates to extracting data entities from documents into a target data schema. BACKGROUND Documents can contain large amounts of data. Data obtained from these documents may not be structured in a desired format. Certain portions of the data might be associated with semantically meaningful categories or labels, but such association may not be explicit in the raw data. Techniques that map data values to one or more desired labels are often described as performing document entity extraction. SUMMARY Aspects and advantages of embodiments of the present disclosure will be set forth in part in the following description, or can be learned from the description, or can be learned through practice of the embodiments. One example aspect of the present disclosure is directed to a computer-implemented method for performing document entity extraction. The method can include receiving, by a computing system comprising a processor, an inference document and a target schema, wherein the inference document comprises document data and one or more reference location tags respectively indicating one or more reference spatial locations of the document data within a rendering of the inference document. The method can also include generating, by the computing system and based on an input dimension of a machine-learned model, one or more document inputs from the inference document and one or more schema inputs from the target schema. The method can further include, for each respective combination of the one or more document inputs and the one or more schema inputs, obtaining one or more extraction inputs by generating, by the computing system, a respective extraction input based on the respective combination, providing, by the computing system, the respective extraction input to the machine-learned model, and receiving, by the computing system, a respective output of the machine-learned model based on the respective extraction input, wherein the respective output comprises entity data extracted according to the target schema and one or more inference location tags corresponding to one or more inference spatial locations of the entity data within the rendering of the inference document. The method can also include validating, by the computing system, the extracted entity data based on the reference spatial locations and the inference spatial locations and outputting, by the computing system, the validated extracted entity data. Another example aspect of the present disclosure is directed to a computing system for performing document entity extraction. The computing system can include one or more processors and a non-transitory, computer-readable medium comprising instructions that, when executed by the one or more processors, cause the one or more processors to perform operations. The operations can include receiving an inference document and a target schema, wherein the inference document comprises document data and one or more reference location tags respectively indicating one or more reference spatial locations of the document data within a rendering of the inference document. The operations can also include generating, based on an input dimension of a machine-learned model, one or more document inputs from the inference document and one or more schema inputs from the target schema. The operations can further include, for each respective combination of the one or more document inputs and the one or more schema inputs, obtaining one or more extraction inputs by generating a respective extraction input based on the respective combination, providing the respective extraction input to the machine-learned model, and receiving a respective output of the machine-learned model based on the respective extraction input, wherein the respective output comprises entity data extracted according to the target schema and one or more inference location tags corresponding to one or more inference spatial locations of the entity data within the rendering of the inference document. The operations can also include validating the extracted entity data based on the reference spatial locations and the inference spatial locations and outputting the validated extracted entity data. Another example aspect of the present disclosure is directed to a non-transitory, computer-readable medium comprising instructions that, when executed by one or more processors, cause the one or more processors to perform operations. The operations can include receiving an inference document and a target schema, wherein the inference document comprises document data and one or mor