EP-4742196-A1 - SYSTEM AND METHOD FOR EXTRACTING KEY-VALUE INFORMATION USING SYNTHETIC TRAINING DATA
Abstract
An end to end system and methodology is employed to obtain key-value pairs from original documents and images. The system and methodology of the present invention may generate synthetic training data from a single source document image or a relatively small number of source document images and use the aforesaid synthetic training data to train a model which is capable of extracting key value pairs from documents even when the source documents do not contain a machine readable source of ground truth data and/or when there is a limited number of source documents available for model training.
Inventors
- XIAO, FENG
- ANTO, James
- NAGABANDI, Badrinath
- ABREU, PABLO YSRRAEL
Assignees
- Socure, Inc.
Dates
- Publication Date
- 20260513
- Application Date
- 20250606
Claims (15)
- A system configured to extract key-value information appearing on a subject document, the system comprising: one or more processors configured to execute computer program modules comprising a first model and a physical storage capability; a training computer program module operative to receive at least one source of training document data and process said training document data to generate synthetic key-value training data, said synthetic key-value training data being implemented as a first model; a data extraction computer program module operative to extract said key-value information from said subject document through the use of said first model; wherein said processing of said training document data to generate synthetic key-value training data comprises the generation of at least one variation with respect to said at least one source of training document data.
- The system of claim 1 wherein said at least one source of training document data comprises data associated with at least one training document.
- The system of claim 2 wherein said generation of at least one variation with respect to said at least one source of training document data comprises the application of a random global offset to at least one field in said at least one training document.
- The system of claim 2 or claim 3 wherein said generation of at least one variation with respect to said at least one source of training document data comprises the application of a random rotation to at least one field in said at least one training document.
- The system of claim 2, claim 3 or claim 4 wherein said generation of at least one variation with respect to said at least one source of training document data comprises the application of a random aspect ratio skew to at least one field in said at least one training document.
- The system of any preceding claim further comprising a labeling computer program module operative to generate category and key-value information for a specific document template associated with said at least one source of training document data.
- The system of any preceding claim wherein said at least one source of training document data does not include ground truth information.
- The system of any preceding claim further comprising a second model, said second model implementing an optical character recognition functionality, and optionally wherein the output of said second model is provided to said data extraction computer program module as input to extract said key-value information from said subject document.
- The system of any preceding claim wherein said subject document comprises one of the following: driver's license, passport, social security card, or voter identification card.
- The system of any preceding claim wherein said first model is refined during production operation using data obtained in connection with said production operation.
- A computer-implemented method of extracting key-value information appearing on a subject document, the method being implemented in a computer system comprising one or more processors configured to execute computer program modules, the method comprising the steps of: receiving at least one source of training document data and processing said training document data to generate synthetic key-value training data, said synthetic key-value training data being implemented as a first model; extracting said key-value information from said subject document through the use of said first model; wherein said processing of said training document data to generate synthetic key-value training data comprises the generation of at least one variation with respect to said at least one source of training document data.
- The method of claim 11 wherein said at least one source of training document data comprises data associated with at least one training document, and optionally wherein said generation of at least one variation with respect to said at least one source of training document data comprises: the application of a random global offset to at least one field in said at least one training document, and/or the application of a random rotation to at least one field in said at least one training document and/or the application of a random aspect ratio skew to at least one field in said at least one training document.
- The method of claim 11 or claim 12 further comprising: the step of generating category and key-value information for a specific document template associated with said at least one source of training document data, and/or the step of implementing an optical character recognition functionality to train a second model, and optionally wherein the output of said second model is provided to said first model as input to extract said key-value information from said subject document.
- The method of any of claims 11 to 13 wherein said at least one source of training document data does not include ground truth information.
- The method of any of claims 11 to 13 wherein said subject document comprises one of the following: driver's license, passport, social security card, or voter identification card, and/or wherein said first model is refined during production operation using data obtained in connection with said production operation.
Description
FIELD OF THE DISCLOSURE Disclosed embodiments relate to the extraction of data contained in documents for further processing, and more specifically, to the use of machine learning systems to extract key and value pair information contained within documents when limited or no ground truth data is available and/or when limited source documents are available. BACKGROUND Optical character recognition (OCR) functionality has been widely available for some time. These systems and methodologies take a document as input and produce contextual data contained within that document as an output. For example, an OCR system may scan in a physical document creating a temporary (or stored) file representing an electronic image of that document. In one case, this image file might be a PDF (Adobe Acrobat) file representing an image of the scanned document. Next, the output of the OCR system is then processed by a second stage module, such as a classification model, in an attempt to generate usable data which may consist of individual data values or key and value pairs. In some cases, more complex relationships between extracted data elements is also possible. By way of example, a physical driver's license might be the physical document scanned with the goal of extracting information from the license with no manual human intervention. It may be desirable, for example, to scan a driver's license and extract first name, last name, date of birth, driver's license number, expiration date and/or any other data contained within the license. The data can then be used by other systems, processes, programs, etc. where the contextual data is required rather than the data being represented in image form where it would not be usable in such downstream systems, processes, programs, etc. It may also be desirable to match these extracted values with a key which describes the nature of the data (a so-called key-value pair). For instance, in order to allow for further processing of the extracted data, it may helpful to match the actual last name (e.g. "SMITH") with the descriptor for that actual value (e.g. "Last Name"). There exist numerous drawbacks in connection with obtaining such key-value pairs and/or generally extracting data from images or documents when this is accomplished using existing OCR based systems. Typical solutions implement a two stage process for capturing and generating key-value pairs from documents and images. First the image/document is scanned and character recognition is performed by the OCR system. In the second stage, the process attempts, via a classification algorithm, to match the characters generated with specific fields to form the key-value pairs. One of the problems that can occur is if the character recognition stage fails, those errors are propagated to the second stage such that the data classification occurring during the second stage can't succeed given the bad input received. Another drawback associated with two stage solutions for generating key-value pairs is the requirement that intermediate results be generated by the first stage which is then operated upon by the second stage processing. The requirement for intermediate results requires additional processing and file storage and can thus burden the computing platform and possibly slowing down processing making some applications that require real time results with heavy processing burdens impossible or very difficult to implement. Yet another drawback is that existing two stage systems are trained independently and as a result, certain important context information may be ignored because each of the models is unaware of the context associated with the other model. As a result, each of the models will not perform as well as preferred because, for example, words or other constructs may not be readily identifiable without the context associated with these constructs which is known to the other model with respect to a particular application. For example, in an image of an ID document, there may be a smudge on the word "name" which is a key of a key-value pair. In existing solutions, where two different models are trained independently, the key "name" may be lost since due to the "name" key being smudged and the fact that the models, being trained independently, are likely to lose the key "name" due to the smudge obscuring it on the physical ID document. In a co-pending application (US Serial No. 18/426,991 entitled "Generative AI System and Method for Key and Value Pair Information Extraction from Documents") assigned to the assignee of the present invention, a system and methodology addressing the foregoing drawbacks is described. The system and methodology described in the co-pending application uses machine learning and trained models to obtain key value pairs from original documents. As described in the co-pending application, the original documents used to train the models typically contain ground truth data in the form of machine readable data a