US-12626015-B2 - Method for securely extracting data from invoices using AI cloud service
Abstract
A computer-implemented method for extracting information from invoices using third-party cloud AI service that includes the steps of performing OCR on an invoice image to generate an OCR Results file containing original words from the invoice and corresponding positional data, generating a JSON file by including original words that appear in a pre-prepared table of permitted words for disclosure with their corresponding positional data, and by indicating only corresponding positional data of original words that unpermitted for disclosure, generating an image based on the generated JSON file, by including the permitted words and alternative words to the unpermitted words, along with their corresponding positional data, uploading the generated image to the AI cloud service for data extraction processing, receiving from the AI cloud service word classifications and corresponding positional data, and using the received word classifications and their corresponding positional data to classify unpermitted original words of the invoice.
Inventors
- Ofer Lavan
- Rotem Ben-Lulu
Assignees
- Ofer Lavan
- Rotem Ben-Lulu
Dates
- Publication Date
- 20260512
- Application Date
- 20241022
Claims (3)
- 1 . A computer-implemented method for extracting information from invoices using third-party cloud AI service, the method comprising: (a) performing, by a computer system, Optical Character Recognition (OCR) on an invoice image to generate an OCR Results file containing original words from the invoice and their corresponding positional data; (b) generating, by the computer system, a JSON file based on the OCR Results file, by including original words that appear in a pre-prepared table of permitted words for disclosure, along with their corresponding positional data, and by indicating corresponding positional data of original words that do not appear in the pre-prepared table, while not including the unpermitted original words in the generated JSON file; (c) generating, by the computer system, an image based on the generated JSON file, by including the permitted original words and alternative words to the unpermitted original words, along with their corresponding positional data; (d) uploading, by the computer system, the generated image to a third-party cloud AI service for data extraction processing; (e) receiving, by the computer system, from the cloud AI service word classifications and their corresponding positional data; (f) using the received word classifications and their corresponding positional data to classify unpermitted original words of the invoice, while protecting the unpermitted words from exposure to third-party AI service systems; wherein the alternative words maintain length and context to said unpermitted original words, thereby preserving a document structure and ensuring uninterrupted AI processing.
- 2 . A computer-implemented method for extracting information from invoices using third-party cloud AI service, the method comprising: (a) performing, by a computer system, Optical Character Recognition (OCR) on an invoice image to generate an OCR Results file containing original words from the invoice and their corresponding positional data; (b) generating, by the computer system, a JSON file based on the OCR Results file, by including original words that appear in a pre-prepared table of permitted words for disclosure, along with their corresponding positional data, and by including alternative words to original words that do not appear in the pre-prepared table and that are not included in the generated JSON file, along their corresponding positional data; (c) uploading, by the computer system, the generated JSON file to a third-party cloud AI service for data extraction processing; (d) receiving, by the computer system, from the cloud AI service word classifications and their corresponding positional data; (e) using the received word classifications and their corresponding positional data to classify unpermitted original words of the invoice, while protecting the unpermitted words from exposure to third-party AI service systems; wherein the alternative words maintain length and context to said unpermitted original words, thereby preserving a document structure and ensuring uninterrupted AI processing.
- 3 . A computer system for securely extracting information from invoices using cloud AI service, comprising: (a) a local processing unit configured to perform Optical Character Recognition (OCR) on an invoice image to generate an OCR results file containing original words from the invoice and corresponding positional data; (b) a JSON generator module, operably connected to the local processing unit, configured to include original words that appear in a pre-prepared table of permitted words for disclosure, along with their corresponding positional data, and to indicate corresponding positional data of original words that do not appear in the pre-prepared table, while not including the unpermitted original words in the generated JSON file; (c) an image generation module, configured to create an image based on the generated JSON file, wherein the image retains the layout and appearance of the original invoice with unpermitted words replaced by alternative words, along their corresponding positional data; (d) a communication interface, configured to upload the image to a third-party cloud AI service for data extraction processing and for receiving word classifications and corresponding positional data; wherein the received word classifications and their corresponding positional data are intended to be used for classifying unpermitted original words of the invoice, while protecting the unpermitted words from exposure to third-party AI service systems; wherein the image generation module replaces unpermitted words with substitute words of identical character length to preserve the document structure during AI cloud processing.
Description
DESCRIPTION Technical Field The present invention relates to the field of data extraction and processing, specifically to a method for extracting information from invoice documents using artificial intelligence (AI) while protecting sensitive company information from potential exposure during the process of uploading invoice data to third-party AI cloud systems or third-party document processing platforms. Background Art In many industries, companies rely on AI to extract relevant information from invoices for automation and data processing purposes. Typically, this process involves uploading an image of an invoice to a third-party AI cloud service or document processing platform, which extracts data from the image of the original document. However, uploading sensitive company information to external AI or platform services poses a significant privacy risk. Sensitive data, such as supplier names, invoice numbers, and payment amounts, may be exposed during transmission and processing by third-party providers. This creates a need for a method that allows companies to utilize AI-driven or platform-based data extraction while ensuring that critical information remains protected. SUMMARY OF THE INVENTION The present invention provides a computer-implemented method for securely extracting information from invoices using AI cloud service, while preventing company information leakage. The invention involves processing the invoice locally to censor sensitive information before uploading it to a third-party AI cloud service. The AI processes the censored data, and the original sensitive information is reintegrated after processing, ensuring both secure handling of the data and reliable extraction of information. The invention applies primarily to invoices but may also extend to other types of documents. The term “invoice” in this disclosure and the claims refers to documents in general, and the term “AI” refers to AI platforms as well as other types of document processing platforms. The term “words” in this disclosure and the claims refers to words and numbers. DESCRIPTION OF THE DRAWINGS The attached drawings are not intended to limit the scope or application of the invention but merely to illustrate one possible implementation. FIG. 1 is a schematic depiction of the system (10). FIG. 2 is a flowchart illustrating the process (method). DETAILED DESCRIPTION OF THE INVENTION The invention comprises the following steps: OCR Processing of the Original Invoice: A local OCR process is applied to an image of the original invoice. This process generates an OCR results file (that can be any format that may include text and positions data such as JSON, XML, and equivalent file formats), which contains words and numbers from the invoice along with the positional data (coordinates) of each word and number. These files create a map of the textual content of the invoice and its corresponding locations on the document. Censoring Sensitive Data: The system identifies in the OCR Results file unpermitted words that need to be protected from exposure, based on identifying permitted words in a predefined table of permitted words; means that all the words that do not appear in the table of the permitted words are unpermitted and need to be replaced with substitute or alternative words. These unpermitted words may be replaced with substitute words from a predefined table or dictionaries. The substitute words are preferably chosen to match the length and context of the original unpermitted words to prevent disruption of the document layout and to ensure effective AI processing. For example, a valid substitute for a street name may also be a different but meaningful street name. This step results in a JSON file where unpermitted words are missing but their existence and their positional data are indicated. The term “JSON file” in this disclosure and in the claims refers to any interchange file format such as JSON or XML. Generation of the Image: An image is created based on the JSON file. This image mirrors the layout and appearance of the original invoice, but with unpermitted words replaced by substitute words. The image is now ready for upload to a third-party AI cloud service without exposing the unpermitted original words that contain sensitive information. Uploading the Image to the AI Cloud service: The image is uploaded to the third-party AI cloud service for data extraction. The AI cloud service performs its standard data extraction process and returns a structured output, such as a JSON file, containing word classifications, positional data, and contextual labels. This file is referred to as the extracted modified JSON file. Reintegration of Original Sensitive Data: Using the extracted JSON file, the system matches the positional data and word classifications with the original unpermitted words. The system replaces the substitute words in the extracted JSON file with the corresponding original unpermitted words, thereby generatin