US-20260127357-A1 - EXTRACTING RELEVANT INFORMATION FROM A DOCUMENT
Abstract
A document associated with a query is preprocessed including by deconstructing the query into individual components and understanding a relationship between the individual components. A query response is received. An annotated version of the document is outputted. The annotated version of the document includes a visual indication of one or more portions of the original document that correspond to the query response.
Inventors
- Shuhao Zhang
- Wenjie Hu
- MingYang Li
Assignees
- Tiny Fish Inc.
Dates
- Publication Date
- 20260507
- Application Date
- 20250620
Claims (20)
- 1 . A method, comprising: preprocessing a document associated with a query including by deconstructing the query into individual components and understanding a relationship between the individual components; receiving a query response; and outputting an annotated version of the document, wherein the annotated version of the document includes a visual indication of one or more portions of the original document that correspond to the query response.
- 2 . The method of claim 1 , further comprising receiving the document associated with the query.
- 3 . The method of claim 1 , wherein the document associated with the query is a document in a portable document format, a text document, a slide deck, a spreadsheet, or a flowchart document.
- 4 . The method of claim 1 , wherein the query includes one or more variables.
- 5 . The method of claim 4 , wherein the query response maps information included in the document to the one or more variables.
- 6 . The method of claim 1 , wherein preprocessing the document associated with the query includes performing optical character recognition on the document.
- 7 . The method of claim 1 , wherein preprocessing the document associated with the query includes modifying a table to include one or more missing lines.
- 8 . The method of claim 1 , wherein preprocessing the document associated with the query includes modifying a size of a font associated with one or more words included in the document.
- 9 . The method of claim 1 , further comprising providing located sections of the document that include elements associated with the query and the query to a cloud service.
- 10 . The method of claim 9 , wherein the cloud service generates a prompt based at least in part on the provided sections of the document that include the elements associated with the query and the query.
- 11 . The method of claim 10 , wherein the cloud service provides the prompt to a large language model, wherein the large language model generates the query response based on the provided prompt.
- 12 . A system, comprising: a processor configured to: preprocess a document associated with a query including by deconstructing the query into individual components and understanding a relationship between the individual components; receive a query response; output an annotated version of the document, wherein the annotated version of the document includes a visual indication of one or more portions of the original document that correspond to the query response; and a memory coupled to the processor and configured to provide the processor with instructions.
- 13 . The system of claim 12 , further comprising receiving the document associated with the query.
- 14 . The system of claim 12 , wherein the document associated with the query is a document in a portable document format, a text document, a slide deck, a spreadsheet, or a flowchart document.
- 15 . The system of claim 12 , wherein the query includes one or more variables.
- 16 . The system of claim 15 , wherein the query response maps information included in the document to the one or more variables.
- 17 . The system of claim 12 , wherein preprocessing the document associated with the query includes performing optical character recognition on the document.
- 18 . The system of claim 12 , wherein preprocessing the document associated with the query includes modifying a table to include one or more missing lines.
- 19 . The system of claim 12 , wherein preprocessing the document associated with the query includes modifying a size of a font associated with one or more words included in the document.
- 20 . A computer program product embodied in a non-transitory computer readable medium and comprising computer instructions for: preprocessing a document associated with a query including by deconstructing the query into individual components and understanding a relationship between the individual components; receiving a query response; and outputting an annotated version of the document, wherein the annotated version of the document includes a visual indication of one or more portions of the original document that correspond to the query response.
Description
CROSS REFERENCE TO OTHER APPLICATIONS This application is a continuation of U.S. patent application Ser. No. 18/939,331 entitled EXTRACTING RELEVANT INFORMATION FROM A DOCUMENT filed Nov. 6, 2024 which is incorporated herein by reference for all purposes. BACKGROUND OF THE INVENTION Optical Character Recognition (OCR) is a technology that transforms various types of documents—such as PDFs, images, and word processing files—into editable and searchable digital text. OCR software identifies the shapes of letters and words in these images, converting them into digital characters. However, current software solutions lack the ability to interpret OCR processed documents with the contextual depth and nuance of a human reader. When humans extract data from a document, they don't review the entire document in detail to absorb all its textual and visual content. Instead, they quickly scan the document, focusing on specific information they need, using semantic and visual cues within the content to locate the relevant data efficiently. BRIEF DESCRIPTION OF THE DRAWINGS Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings. FIG. 1A illustrates a system to extract relevant information from a document in accordance with some embodiments. FIG. 1B illustrates a system to extract relevant information from a document in accordance with some embodiments. FIG. 2 illustrates a process to extract relevant information from a document in accordance with some embodiments. FIG. 3 illustrates an example of a query in accordance with some embodiments. FIG. 4 is an example of a query response in accordance with some embodiments. FIG. 5 is an annotated version of a document in accordance with some embodiments. DETAILED DESCRIPTION The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions. A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured. Systems and methods to extract relevant information from a document are disclosed herein. A document, such as a PDF, text document, image, slide deck, spreadsheet, flowchart document, etc., may undergo OCR to generate an OCR processed document that includes editable and searchable digital text. A user may provide a query specifying the relevant information they want to extract from the OCR processed document. However, unlike HTML pages, OCR processed documents lack the structure to find the relevant information associated with a query. Utilizing the systems and methods disclosed herein, relevant information associated with a document will be extracted and provided in response to a query. The systems and methods disclosed herein enable relevant information to be extracted from any document, that is, any type of query for any type of document may be determined. That is, a structured query may be performed on any unstructured document. FIG. 1A is a block diagram illustrating an embodiment of a system to extract relevant information from a document in accordance with some embodiments. In the example shown, system 100 includes client device 102, runtime agent 112, cloud service 122, large language model 132, and inference patterns store 142. C