EP-4375955-B1 - METHOD AND SYSTEM FOR EVALUATING QUALITY OF A DOCUMENT

EP4375955B1EP 4375955 B1EP4375955 B1EP 4375955B1EP-4375955-B1

Inventors

DAS, Tarun Kumar
MALLICK, Triptesh
SINGH, MADHUSUDAN
KUMAR, Pragyesh
BALARAMAN, MRIDUL

Dates

Publication Date: 20260513
Application Date: 20230313

Claims (11)

A method (600) of determining quality of a document image, the method comprising: segmenting (604), by a computing device, the document image into a plurality of regions, wherein each of the plurality of regions comprises text data; classifying (606), by the computing device, each of the plurality of regions into one of a plurality of image quality classes, wherein the quality of the document image corresponds to the accuracy of optical character recognition, OCR, processes applied to the document image; wherein each of the plurality of regions is classified based on a determination of a highest confidence score from one of a plurality of machine learning models, wherein each of the plurality of machine learning models is trained corresponding to one of the plurality of image quality classes, wherein the training of each of the plurality of machine learning models corresponding to one of the plurality of image quality classes comprises determining a training dataset for each of the plurality of image quality classes, and wherein the determination of the training dataset for each of the plurality of image quality classes comprises: segmenting (702), by a computing device, a training image into a plurality of regions, wherein each of the plurality of regions comprises text data; for each of the plurality of regions, performing (704), by the computing device, OCR using two or more OCR systems to determine corresponding two or more OCR text data; determining (706), by the computing device, text matching scores based on a comparison among the two or more OCR text data using a plurality of string matching techniques; determining (708), by the computing device, a plurality of threshold values for the plurality of image quality classes based on a statistical analysis of the text matching scores based on the plurality of string matching techniques and for the plurality of regions; and clustering (710), by the computing device, the plurality of regions into one of the plurality of image quality classes based on the plurality of threshold values; computing (612), by the computing device, a cumulative quality score for the document image based on a weighted average of a number of regions classified into each of the plurality of image quality classes; and determining (614), by the computing device, the quality of the document image based on the cumulative quality score.
The method as claimed in claim 1, wherein the determination of the plurality of threshold values comprises: computing, for each of the plurality of regions, a minimum text matching score, a maximum text matching score, and an average text matching score based on the text matching scores for the plurality of string matching techniques; and determining the plurality of threshold values based on a statistical calculation of the minimum, the maximum, and the average text matching scores for the plurality of regions.
The method as claimed in claim 1, wherein the plurality of threshold values comprises a lower threshold value and an upper threshold value, and wherein the plurality of image quality classes comprises a bad image quality class for document images for the average text matching score less than or equal to the lower threshold value, a good image quality class for the average text matching score greater than or equal to the upper threshold value, and a medium image quality class for the average text matching score greater than the lower threshold value and less than the upper threshold value.
The method as claimed in claim 1, wherein the plurality of regions comprises at least one of a region with word level text data, a region with sentence level text data, a region with paragraph level text data, and a region with page level text data.
The method as claimed in claim 1, wherein the computing device is configured to determine quality of a document comprising a plurality of document images by: computing a cumulative quality score for each of the plurality of document images based on the weighted average of the number of regions classified into each of the plurality of image quality classes; and determining the quality of the document based on the cumulative quality score for each of the plurality of document images.
The method as claimed in claim 1, wherein the computing device is configured to re-train the plurality of machine learning models based on a variance in a confidence score from each of the plurality of machine learning models.
A system (100) for determining quality of a document image, comprising: one or more processors (108); a memory (110) communicatively coupled to the processors, wherein the memory stores a plurality of processor-executable instructions, which, upon execution, cause the processors to: segment the document image into a plurality of regions, wherein each of the plurality of regions comprises text data; classify each of the plurality of regions into one of a plurality of image quality classes, wherein the quality of the document image corresponds to the accuracy of optical character recognition, OCR, processes applied to the document image; wherein each of the plurality of regions is classified based on a determination of a highest confidence score from one of a plurality of machine learning models, wherein each of the plurality of machine learning models is trained corresponding to one of the plurality of image quality classes, wherein the training of each of the plurality of machine learning models corresponding to one of the plurality of image quality classes comprises determining a training dataset for each of the plurality of image quality classes, and wherein the one or more processors are further configured to determine the training dataset for each of the plurality of image quality classes by: segmenting, by a computing device, a training image into a plurality of regions, wherein each of the plurality of regions comprises text data; for each of the plurality of regions, performing OCR using two or more OCR systems to determine corresponding two or more OCR text data; determining text matching scores based on a comparison among the two or more OCR text data using a plurality of string matching techniques; determining a plurality of threshold values for the plurality of image quality classes based on a statistical analysis of the text matching scores for the plurality of string matching techniques and for the plurality of regions; and clustering the plurality of regions into one of the plurality of image quality classes based on the plurality of threshold values; compute a cumulative quality score for the document image based on a weighted average of a number of regions classified into each of the plurality of image quality classes; and determine the quality of the document image based on the cumulative quality score.
The system as claimed in claim 7, wherein the determination of the plurality of threshold values comprises: computing, for each of the plurality of regions, a minimum text matching score, a maximum text matching score, and an average text matching score based on the text matching scores for the plurality of string matching techniques; and determining the plurality of threshold values based on a statistical calculation of the minimum, the maximum, and the average text matching scores for the plurality of regions.
The system as claimed in claim 7, wherein the plurality of threshold values comprises a lower threshold value and an upper threshold value, and wherein the plurality of image quality classes comprises a bad image quality class for document images for the average text matching score less than or equal to the lower threshold value, a good image quality class for the average text matching score greater than or equal to the upper threshold value, and a medium image quality class for the average text matching score greater than the lower threshold value and less than the upper threshold value.
The system as claimed in claim 7, the plurality of regions comprises at least one of a region with word level text data, a region with sentence level text data, a region with paragraph level text data, and a region with page level text data.
The system as claimed in claim 7, wherein the one or more processors are further configured to determine quality of a document comprising a plurality of document images by: computing a cumulative quality score for each of the plurality of document images based on the weighted average of the number of regions classified into each of the plurality of image quality classes; and determining the quality of the document based on the cumulative quality score for each of the plurality of document images.

Description

Technical Field This disclosure relates generally to image processing, and more particularly to a system and a method for determining quality of a document. BACKGROUND There is a constant requirement for performing Optical Character Recognition (OCR) in order to extract the data from documents for various purposes. However, the correctness of data extracted from the documents using OCR techniques depends on the quality of the documents. OCR systems tend to extract erroneous data from poor quality documents which have poor resolution, noise, etc. Data extracted from such poor quality documents is not consistent and varies from one OCR algorithm to another. Therefore, OCR systems cannot determine the quality of a document which limits their capability and accuracy. Document Li Hongyu et. al. (Towards document image quality assessment: A text line framework and a synthetic text line Image dataset ) relates to a robust DIQA framework that evaluates image quality using detected text lines and a synthetic dataset to train a CNN model, improving resilience to background noise. Document US 2021/319247 A1 relates to text classification, in particular to a text classifying apparatus, an optical character recognition apparatus, a text classifying method, and a program. Document Le Kang et al. (A deep learning approach to document image quality assessment) relates to a deep learning approach for document image quality assessment. Therefore, there is a requirement to determine the quality of documents before performing OCR in order to extract correct data from documents irrespective of their quality. SUMMARY OF THE INVENTION The aim of the present invention is to provide a method and a system that overcome the issues mentioned above. According to the present invention, a method and a system are provided, as defined in the annexed claims. Various objects, features, aspects and advantages of the inventive subject matter will become more apparent from the following detailed description of preferred embodiments, along with the accompanying drawing figures in which like numerals represent like components. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles. FIG. 1 is a block diagram of a document quality metric (DQM) determination system, in accordance with an embodiment of the present disclosure.FIG. 2 illustrates a functional block diagram of the DQM determination device 102, in accordance with an embodiment of the present disclosure.FIG. 3A depicts a dataset 300A comprising a plurality of regions comprising text data, in accordance with an embodiment of the present disclosure.FIG. 3B is a table 300B depicting exemplary outputs received from the data segregation module of FIG. 2, in accordance with an exemplary embodiment of the present disclosure.FIG. 4 is a data segregation flow diagram, in accordance with an embodiment of the present disclosure.FIG. 5A, 5B and 5C depicts an exemplary bad training dataset, medium training dataset and good training dataset generated from the input dataset, in accordance with an embodiment of the present disclosure.FIG. 6 is a flowchart depicting methodology of determining quality of a document image, in accordance with an embodiment of the present disclosure.FIG. 7 is a flowchart of a method of determining training datasets, in accordance with an embodiment of the present disclosure. DETAILED DESCRIPTION OF THE DRAWINGS Exemplary embodiments are described with reference to the accompanying drawings. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. The accuracy of text extraction depends on the quality of document image. Different optical character recognition (OCR) systems may give different results without providing any information about the accuracy of the extracted text. Therefore, determination of document quality before performing OCR would allow for an accurate extraction of textual data from a document. The present disclosure provides methods and systems for determining document quality metric of a document comprising one or more document images. FIG. 1 is a block diagram of a document quality metric (DQM) determination system 100 determining document quality metric of a document comprising one or more document images, in accordance with an embodiment of the present disclosure. The DQM determination system 100 may include a Document Quality Metric (DQM) determination device 102 comprising one or more processors 108, a memory 110 and an input/output device 106. The DQM determination device 102 may be communicably connected to a database 104 and an external device 118 through a network 112. In an embodiment, the database 104 may be enabled in a cloud or a physical database comprising training data such as document images of varying