US-20260127165-A1 - STRUCTURING-BASED KEY INFORMATION EXTRACTION IN MULTIMODAL MODELS FOR ENHANCING DOCUMENT UNDERSTANDING

US20260127165A1US 20260127165 A1US20260127165 A1US 20260127165A1US-20260127165-A1

Abstract

A system and method for extracting structured key information from diverse document types using large multimodal models (LMMs) is disclosed. The invention employs a zero-shot analysis to identify candidate keys within an input document, then selects a document schema from a document schema database based on the identified keys. The LMM is prompted with the selected document schema to generate structured key-value pairs, with field constraints enforced by the document schema. Relationships among extracted keys are mapped to a graph representation, enabling robust handling of complex document layouts. The system supports nested structures, tabular data, and alias definitions for fields, and can update document schemas based on ground truth feedback. The resulting structured output is provided in a machine-readable format, enabling reliable and scalable document understanding across varied domains such as invoices, health cards, and driving licenses.

Inventors

Ashvini Kumar Sharma
Shirish Amit Bajpai
Amrit Bhaskar
Ankit Kumar Aggarwal
Reetesh Mukul

Assignees

ORACLE INTERNATIONAL CORPORATION

Dates

Publication Date: 20260507
Application Date: 20250825
Priority Date: 20241106

Claims (20)

1 . A method, comprising: receiving, by a computing system, an input document comprising at least one of image data or text data; generating, by a large multimodal model (LMM) of the computing system, a first set of candidate keys present in the input document via a first type of analysis; selecting, by the computing system, a document schema from a schema database based at least in part on the first set of candidate keys, the document schema comprising a plurality of fields, each field of the plurality of fields specifying at least one of a key, expected data type, or a default status; generating, by the LMM of the computing system, a structured output comprising key-value pairs from the input document, the LMM being prompted using the selected document schema to enforce output structure and field constraints; and outputting, by the computing system, the structured output in a machine-readable format according to the output structure and the field constraints.
2 . The method of claim 1 , wherein the document schema further specifies explicit data types and default values for each field, and wherein absent fields in the input document are assigned the respective default values in the structured output.
3 . The method of claim 1 , wherein the document schema supports nested structures and tabular data, and wherein the step of generating the structured output further comprises extracting and hierarchically organizing nested or tabular information from the document.
4 . The method of claim 1 , further comprising, prior to outputting the structured output: evaluating the structured output against ground truth data, and updating the document schema in the document schema database based on the evaluation if an accuracy metric falls below a predefined threshold.
5 . The method of claim 1 , wherein the LMM is prompted with specialized prompt templates tailored to the document type, the prompt templates being dynamically generated or adapted based on the selected document schema.
6 . The method of claim 1 , further comprising: mapping, by a graph-based module, relationships among extracted key-value pairs to a graph representation, wherein nodes correspond to keys and edges correspond to relationships between keys; and wherein the graph-based module identifies at least one subgraph comprising a subset of the extracted keys and relationships, the subgraph being selected to maximize a utility function defined over nodes and edges.
7 . The method of claim 1 , wherein the document schema includes alias definitions for at least one field, and wherein the LMM is configured to recognize and extract values corresponding to any of the aliases definitions associated with the field.
8 . A system for extracting structured key information from a document, the system comprising: at least one processor; and a memory storing instructions that, when executed by the at least one processor, cause the system to: receive, by a computing system, an input document comprising at least one of image data or text data; generate, by a large multimodal model (LMM) of the computing system, a first set of candidate keys present in the input document via a first type of analysis; select, by the computing system, a document schema from a schema database based at least in part on the first set of candidate keys, the document schema comprising a plurality of fields, each field of the plurality of fields specifying at least one of a key, expected data type, or a default status; generate, by the LMM of the computing system, a structured output comprising key-value pairs from the input document, the LMM being prompted using the selected document schema to enforce output structure and field constraints; and output, by the computing system, the structured output in a machine readable format according to the output structure and the field constraints.
9 . The system of claim 8 , wherein the document schema further specifies explicit data types and default values for each field, and wherein the instructions further cause the system to assign the respective default values to absent fields in the structured output.
10 . The system of claim 8 , wherein the document schema supports nested structures and tabular data, and wherein the instructions further cause the system to extract and hierarchically organize nested or tabular information from the input document.
11 . The system of claim 8 , wherein the instructions further cause the system to, prior to outputting the structured output, evaluate the structured output against ground truth data, and update the document schema in the document schema database based on the evaluation if an accuracy metric falls below a predefined threshold.
12 . The system of claim 8 , wherein the LMM is prompted with a prompt template that is dynamically generated or adapted based on the selected document schema and the document type identified.
13 . The system of claim 8 , further comprising: map, by a graph-based module, relationships among extracted key-value pairs to a graph representation, wherein nodes correspond to keys and edges correspond to relationships between keys; wherein mapping relationships among extracted key-value pairs to a graph representation comprises identifying a subgraph corresponding to a subset of the extracted keys and relationships, the subgraph being selected to maximize a utility function defined over nodes and edges.
14 . The system of claim 8 , wherein the document schema includes alias definitions for at least one field, and wherein the instructions further cause the system to recognize and extract values corresponding to any of the alias definitions associated with the field.
15 . The system of claim 8 , wherein the machine-readable format comprises a structured data format selected from the group consisting of: JSON, XML, and YAML.
16 . A non-transitory computer-readable medium storing instructions that, when executed by at least one processor, cause a computing system to perform a method comprising: receiving, by a computing system, an input document comprising at least one of image data or text data; generating, by a large multimodal model (LMM) of the computing system, a first set of candidate keys present in the input document via a first type of analysis; selecting, by the computing system, a document schema from a schema database based at least in part on the first set of candidate keys, the document schema comprising a plurality of fields, each field of the plurality of fields specifying at least one of a key, expected data type, or a default status; generating, by the LMM of the computing system, a structured output comprising key-value pairs from the input document, the LMM being prompted using the selected document schema to enforce output structure and field constraints; and outputting, by the computing system, the structured output in a machine-readable format according to the output structure and the field constraints.
17 . The non-transitory computer-readable medium of claim 16 , wherein the document schema further specifies explicit data types and default values for each field, and wherein the method further comprises assigning the respective default values to absent fields in the structured output.
18 . The non-transitory computer-readable medium of claim 16 , wherein the document schema supports nested structures and tabular data, and wherein the method further comprises extracting and hierarchically organizing nested or tabular information from the input document.
19 . The non-transitory computer-readable medium of claim 16 , wherein the method further comprises, prior to outputting the structured output, evaluating the structured output against ground truth data, and updating the document schema in the document schema database based on the evaluation if an accuracy metric falls below a predefined threshold.
20 . The non-transitory computer-readable medium of claim 16 , wherein mapping relationships among extracted key-value pairs to a graph representation comprises identifying a subgraph corresponding to a subset of the extracted keys and relationships, the subgraph being selected to maximize a utility function defined over nodes and edges.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This application claims the benefit of priority to Indian Provisional Patent Application No. 2024/41085101, filed on Nov. 6, 2024, entitled “Structuring-Based Key Information Extraction In Multimodal Models For Enhancing Document Understanding.” The entire disclosure of the Indian provisional application is hereby incorporated by reference in its entirety for all purposes. FIELD The present disclosure relates generally to automated document understanding systems and, more particularly, but not necessarily exclusively, to techniques for leveraging large multimodal models and structured schema-driven frameworks to extract, structure, and represent key information from diverse document types, including but not limited to invoices, health cards, and driving licenses. BACKGROUND Automated document understanding has become increasingly important as organizations manage large volumes of digital documents in domains such as finance, healthcare, and government administration. Historically, information extraction from documents has relied on rule-based systems, template matching, and traditional optical character recognition (OCR) technologies to identify and extract key fields or data points from structured or semi-structured documents, such as invoices, identification cards, and forms. While these conventional approaches have provided some automation benefits, they often require extensive customization for each document type, struggle to adapt to variations in layout and content, and may lack robustness in handling unstructured or complex formats. More recently, advances in machine learning and deep learning, including the use of multimodal models capable of processing both text and image data, have enabled improved performance in extracting information from diverse documents. However, these systems may still face challenges in generalizing across document types, maintaining accuracy, and ensuring reliable structured outputs in varying real-world scenarios. BRIEF SUMMARY Various embodiments described herein relate to systems, methods, and computer program products for extracting structured key information from documents using large multimodal models (LMMs). In some examples, a system may include one or more computers, each of which can be configured by software, firmware, hardware, or any combination thereof, to perform operations as described. One or more computer programs, when executed by data processing apparatus, may cause the apparatus to carry out the described actions. In various embodiments, techniques for extracting structured key information from documents may be provided using a computing system that can include at least one processor and a memory. In certain implementations, a computing system may be configured to receive an input document that can include image data, text data, or both. The system may use a large multimodal model to perform zero-shot analysis on the input document, generating a first set of candidate keys present in the document without any prior training specific to the document type. Based on the identified candidate keys, the system may select a document schema from a schema database. The selected schema may define multiple fields, with each field specifying a key, an expected data type, a required or default status, and one or more optional aliases. The LMM may be prompted using the selected schema to generate a structured output comprising key-value pairs extracted from the input document, where the schema can be used to enforce field structure and data-type constraints. In some embodiments, the schema can further specify explicit data types and default values for each field, and absent fields in the input document may be assigned their respective default values in the structured output. The schema may also support nested structures and tabular data, such that the structured output can include hierarchically organized information or tabular representations, as appropriate for the document type. A graph-based module may be included to map relationships among the extracted key-value pairs to a graph representation, where nodes may correspond to keys and edges may represent relationships between keys. In some examples, the graph-based module can identify at least one subgraph comprising a subset of the extracted keys and relationships, where the subgraph may be selected to maximize a utility function defined over nodes and edges, thereby optimizing the accuracy and relevance of the extracted information. The system or method may further include, prior to outputting the structured output, evaluating the structured output against ground truth data and updating the schema in the schema database based on the evaluation if an accuracy metric falls below a predefined threshold. The LMM may be prompted with specialized prompt templates that can be tailored to the document type, with such prompt templates being dynamically generated or adapted based o