US-20260127210-A1 - Generative AI Agent For Intelligent Document Processing And Management
Abstract
Systems and methods for an application using generative artificial intelligence and machine learning techniques to process documents in response to user prompts are provided. A method can include receiving a document and a user prompt including a request to perform a task. The method can include generating a semantic search query and a graph search query using the user prompt and the document. The method can include performing a semantic search of a vector database using the semantic search query and performing a graph search of a metadata graph using the graph search query. The method can include determining, using results of the semantic search and graph search, a context associated with the document and generating an output associated with the document by applying a machine learning model to an input comprising the context and document.
Inventors
- Zhihong Zeng
- Shivam Mittal
- Samriddhi Shakya
- Sushant Tiwari
- Narasimha Goli
Assignees
- IRON MOUNTAIN INCORPORATED
Dates
- Publication Date
- 20260507
- Application Date
- 20251031
Claims (20)
- 1 . A computer-implemented method of document image processing, the method comprising: receiving a document and a user prompt including a request to perform a task; generating a semantic search query and a graph search query using the user prompt and the document; performing a semantic search of a vector database using the semantic search query and performing a graph search of a metadata graph using the graph search query; determining, using results of the semantic search and graph search, a context associated with the document, wherein the context includes one or more of summaries of sections of the document, entities associated with the document, and keywords associated with the document; and generating an output associated with the document by applying a machine learning model to an input comprising the context and document.
- 2 . The computer-implemented method of claim 1 , wherein the machine learning model comprises a large language model or a multimodal model.
- 3 . The computer-implemented method of claim 1 , wherein the output comprises a summary, a table, an answer to a question associated with the user prompt, or any combination thereof.
- 4 . The computer-implemented method of claim 1 , further comprising: generating for display, via a display of a user device, a user interface including an agent configuration section, a task input section, and a task output section; receiving, at the agent configuration section, a user selection of the document from a plurality of documents, a machine learning model selection from a plurality of machine learning models, and an optical recognition service from a plurality of optical recognition services; receiving, at the task input section, the user prompt associated with the task to be performed by the machine learning model; and in response to receiving the document, the machine learning model selection, the optical recognition service, and the user prompt: processing the document using the optical recognition service to thereby generate a processed document; invoking the machine learning model to perform the task on the processed document and generate an output associated with the task; and displaying, via the display of the user device, the output at the task output section of the user interface.
- 5 . The computer-implemented method of claim 1 , wherein generating the semantic search query comprises encoding the user prompt into a vector representation using a transformer-based language model, and wherein the semantic search comprises performing a nearest neighbor search in the vector database using cosine similarity or Euclidean distance metrics.
- 6 . The computer-implemented method of claim 1 , wherein the vector database stores precomputed vector embeddings of document sections, paragraphs, or entities, and wherein performing the semantic search returns content based at least in part on a similarity to an encoded user prompt.
- 7 . The computer-implemented method of claim 1 , wherein generating the graph search query comprises extracting entities and relationships from the user prompt and mapping them to nodes and edges in the metadata graph, and wherein the method further comprises: searching the metadata graph by traversing nodes and edges using a graph traversal algorithm to identify document elements that share metadata attributes associated with the user prompt.
- 8 . The computer-implemented method of claim 1 , further comprising: combining results of the semantic search and the graph search; ranking the results to extract a most relevant context for processing the document with the machine learning model; and labeling the most relevant context as the context.
- 9 . A system comprising: one or more processors; and one or more memories storing computer-executable instructions that, when executed by the one or more processors, cause the one or more processors to: receive a document and a user prompt including a request to perform a task; generate a semantic search query and a graph search query using the user prompt and the document; perform a semantic search of a vector database using the semantic search query and performing a graph search of a metadata graph using the graph search query; determine, using results of the semantic search and graph search, a context associated with the document, wherein the context includes one or more of summaries of sections of the document, entities associated with the document, and keywords associated with the document; and generate an output associated with the document by applying a machine learning model to an input comprising the context and document.
- 10 . The system of claim 9 , wherein the machine learning model comprises a large language model or a multimodal model.
- 11 . The system of claim 9 , wherein the output comprises a summary, a table, an answer to a question associated with the user prompt, or any combination thereof.
- 12 . The system of claim 9 , wherein the instructions further cause the one or more processors to: generate for display, via a display of a user device, a user interface including an agent configuration section, a task input section, and a task output section; receive, at the agent configuration section, a user selection of the document from a plurality of documents, a machine learning model selection from a plurality of machine learning models, and an optical recognition service from a plurality of optical recognition services; receive, at the task input section, the user prompt associated with the task to be performed by the machine learning model; and in response to receiving the document, the machine learning model selection, the optical recognition service, and the user prompt: process the document using the optical recognition service to thereby generate a processed document; invoke the machine learning model to perform the task on the processed document and generate an output associated with the task; and display, via the display of the user device, the output at the task output section of the user interface.
- 13 . The system of claim 9 , wherein generating the semantic search query comprises encoding the user prompt into a vector representation using a transformer-based language model, and wherein the semantic search comprises performing a nearest neighbor search in the vector database using cosine similarity or Euclidean distance metrics.
- 14 . The system of claim 9 , wherein the vector database stores precomputed vector embeddings of document sections, paragraphs, or entities, and wherein performing the semantic search returns content based at least in part on a similarity to an encoded user prompt.
- 15 . The system of claim 9 , wherein generating the graph search query comprises extracting entities and relationships from the user prompt and mapping them to nodes and edges in the metadata graph, and wherein the instructions further cause the one or more processors to: search the metadata graph by traversing nodes and edges using a graph traversal algorithm to identify document elements that share metadata attributes associated with the user prompt.
- 16 . The system of claim 9 , wherein the instructions further cause the one or more processors to: combine results of the semantic search and the graph search; rank the results to extract a most relevant context for processing the document with the machine learning model; and label the most relevant context as the context.
- 17 . A non-transitory computer-readable medium comprising instructions that are executable by one or more processors to cause the one or more processors to perform operations comprising: receiving a document and a user prompt including a request to perform a task; generating a semantic search query and a graph search query using the user prompt and the document; performing a semantic search of a vector database using the semantic search query and performing a graph search of a metadata graph using the graph search query; determining, using results of the semantic search and graph search, a context associated with the document, wherein the context includes one or more of summaries of sections of the document, entities associated with the document, and keywords associated with the document; and generating an output associated with the document by applying a machine learning model to an input comprising the context and document.
- 18 . The non-transitory computer-readable medium of claim 17 , wherein the machine learning model comprises a large language model or a multimodal model.
- 19 . The non-transitory computer-readable medium of claim 17 , wherein the output comprises a summary, a table, an answer to a question associated with the user prompt, or any combination thereof.
- 20 . The non-transitory computer-readable medium of claim 17 , wherein the instructions further cause the one or more processors to perform operations comprising: generating for display, via a display of a user device, a user interface including an agent configuration section, a task input section, and a task output section; receiving, at the agent configuration section, a user selection of the document from a plurality of documents, a machine learning model selection from a plurality of machine learning models, and an optical recognition service from a plurality of optical recognition services; receiving, at the task input section, the user prompt associated with the task to be performed by the machine learning model; and in response to receiving the document, the machine learning model selection, the optical recognition service, and the user prompt: processing the document using the optical recognition service to thereby generate a processed document; invoking the machine learning model to perform the task on the processed document and generate an output associated with the task; and displaying, via the display of the user device, the output at the task output section of the user interface.
Description
CROSS REFERENCE TO RELATED APPLICATIONS This application claims priority to U.S. Provisional Application No. 63/715,346, filed Nov. 1, 2025, the entirety of which is incorporated by reference herein for all purposes. TECHNICAL FIELD The field of the present disclosure relates to document processing using machine learning techniques. In particular, the present disclosure relates to systems and methods for an application using generative artificial intelligence and machine learning techniques to process documents in response to user prompts. BACKGROUND Document processing is an important endeavor allowing for sorting documents, classifying documents, interpreting contents of documents, and preparing documents for analysis. Existing data processing methods using machine learning techniques may be inadequate as the techniques may not provide context aware interactions and may not provide user-defined domain context. SUMMARY Certain embodiments involve document processing using a user-friendly generative artificial intelligence agent. In one example, a computer-implemented method of document image processing is provided. The method can include receiving a document and a user prompt including a request to perform a task; generating a semantic search query and a graph search query using the user prompt and the document; performing a semantic search of a vector database using the semantic search query and performing a graph search of a metadata graph using the graph search query; determining, using results of the semantic search and graph search, a context associated with the document, wherein the context includes one or more of summaries of sections of the document, entities associated with the document, and keywords associated with the document; and generating an output associated with the document by applying a machine learning model to an input comprising the context and document. In another example, a system is provided. The system can include one or more processors and one or more memories. The one or more memories can store computer-executable instructions that, when executed by the one or more processors, cause the one or more processors to: receive a document and a user prompt including a request to perform a task; generate a semantic search query and a graph search query using the user prompt and the document; perform a semantic search of a vector database using the semantic search query and performing a graph search of a metadata graph using the graph search query; determine, using results of the semantic search and graph search, a context associated with the document, wherein the context includes one or more of summaries of sections of the document, entities associated with the document, and keywords associated with the document; and generate an output associated with the document by applying a machine learning model to an input comprising the context and document. In yet another example, a non-transitory computer-readable medium is provided. The non-transitory computer-readable medium can include instructions that are executable by one or more processors to cause the one or more processors to perform operations comprising: receiving a document and a user prompt including a request to perform a task; generating a semantic search query and a graph search query using the user prompt and the document; performing a semantic search of a vector database using the semantic search query and performing a graph search of a metadata graph using the graph search query; determining, using results of the semantic search and graph search, a context associated with the document, wherein the context includes one or more of summaries of sections of the document, entities associated with the document, and keywords associated with the document; and generating an output associated with the document by applying a machine learning model to an input comprising the context and document. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows a block diagram of a computing environment for providing a generative artificial intelligence agent, according to certain aspects of the present disclosure. FIG. 2 shows a block diagram demonstrating an example flow diagram showing an example framework for providing a generative artificial intelligence agent, according to certain aspects of the present disclosure. FIG. 3 shows a flow chart demonstrating a process for providing a generative artificial intelligence agent, according to certain aspects of the present disclosure. FIG. 4 shows a block diagram demonstrating a scaled implementation of a generative artificial intelligence agent, according to certain aspects of the present disclosure. FIG. 5 shows an example user interface for a generative artificial intelligence agent, according to certain aspects of the present disclosure. FIG. 6 shows a block diagram of an example computing device, according to certain aspects of the present disclosure. DETAILED DESCRIPTION The subject matter of embodiment