EP-4736140-A1 - METHODS AND APPARATUS FOR EXTRACTING DATA FROM A DOCUMENT BY ENCODING IT WITH TEXTUAL AND VISUAL FEATURES AND USING MACHINE LEARNING

EP4736140A1EP 4736140 A1EP4736140 A1EP 4736140A1EP-4736140-A1

Abstract

An apparatus including a processor caused to receive document images, each including representations of characters. The processor is caused to parse each document image to extract, based on structure type, subsets of characters, to generate a text encoding for that document image. For each document, the processor is caused to extract visual features to generate a visual encoding for that document image, each visual feature associated with a subset of characters. The processor is caused to generate parsed documents, each parsed document uniquely associated with a document image and based on the text and visual encoding for that document image. For each parsed document, the processor is caused to identify sections uniquely associated with section type. The processor is caused to train machine learning models, each machine learning model associated with one section type and trained using a portion of each parsed document associated with that section type.

Inventors

XYLOURIS, TRIANTAFYLLOS

Assignees

Greenhouse Software, Inc.

Dates

Publication Date: 20260506
Application Date: 20240628

Claims (20)

1. A non-transitory, processor-readable medium storing instructions that executed by a processor, cause the processor to: receive a plurality of document images each including a plurality of representations of characters; parse each document image from the plurality of document images to extract a plurality of subsets of characters from the plurality of representations of characters to generate a text encoding for that document image, each subset of characters being associated with a structure type from a plurality of structure types; for each document image from the plurality of document images, extract a plurality of visual features to generate a visual encoding for that document image, each visual feature from the plurality of visual features associated with at least one subset of characters from the plurality of subsets of characters; generate a plurality of parsed documents, each parsed document from the plurality of parsed documents uniquely associated with a document image from the plurality of document images and being based on the text encoding and the visual encoding for that document image; for each parsed document from the plurality of parsed documents, identify a plurality of sections, each section from the plurality of sections uniquely associated with a section type from a plurality of section types; and train a plurality of machine learning models to produce a plurality of trained machine learning models, each machine learning model from the plurality of machine learning models associated with one section type from the plurality of section types and trained using a portion of each parsed document that is from the plurality of parsed documents and that is associated with that section type.
2. The non-transitory, processor-readable medium of claim 1, wherein the instructions to cause the processor to parse each document from the plurality of documents further comprises instructions to cause the processor to: define a bounding box associated with at least one of a word, phrase, date, time, or sentence, a subset of characters from the plurality of subset of characters being associated with the bounding box.
3. The non-transitory, processor-readable medium of claim 1, wherein the instructions to cause the processor to parse each document from the plurality of documents further comprises instructions to cause the processor to: define a bounding box associated with at least one of a word, phrase, date, time, or sentence, a subset of characters from the plurality of subset of characters being associated with the bounding box, and a visual feature from the plurality of visual features being associated with the bounding box.
4. The non-transitory, processor-readable medium of claim 1, wherein each trained machine learning model from the plurality of trained machine learning models is configured to extract at least a feature from the section type that is uniquely associated with that trained machine learning model.
5. The non-transitory, processor-readable medium of claim 1, wherein each trained machine learning model from the plurality of trained machine learning models is configured to: extract at least a feature from the section type that is uniquely associated with that trained machine learning model; and populate a structured data file with the feature extracted by that trained machine learning model.
6. The non-transitory, processor-readable medium of claim 1, the instructions further comprising instructions configured to cause the processor to: receive a first document image that is not from the plurality of document images; identify a first section in the first document image that is uniquely associated with a section type from the plurality of section types; and apply a first trained machine learning model from the plurality of trained machine learning models to the first section in the first document image to produce at least a portion of a structured data file that identifies a feature of the first section in the first document.
7. The non-transitory, processor-readable medium of claim 1, wherein the instructions to cause the processor to train the plurality of machine learning models further includes instructions to train at least one machine learning model from the plurality of machine learning models to: extract a feature from the section type that is uniquely associated with that trained machine learning model; and generate a confidence score associated with the feature extracted from the section type.
8. The non-transitory, processor-readable medium of claim 1, the instructions further comprising instructions configured to cause the processor to: receive a first document image that is not from the plurality of document images; identify a first section in the first document image that is uniquely associated with a first section type from the plurality of section types; identify a second section in the first document that is uniquely associated with a second section type from the plurality of section types; apply a first trained machine learning model from the plurality of trained machine learning models to the first section in the first document image to produce at least a first portion of a structured data file that identifies a feature of the first section in the first document; and apply a second trained machine learning model from the second plurality of trained machine learning models to the second section in the first document image to produce at least a second portion of the structured data file that identifies a feature of the second section in the first document.
9. A non-transitory, processor-readable medium storing instructions that executed by a processor, cause the processor to: receive a document image that includes a plurality of representations of characters; parse the document image to extract a plurality of subsets of characters from the plurality of representations of characters to generate a text encoding for the document image, each subset of characters being associated with a structure type from a plurality of predefined structure types; extract a plurality of visual features based on the text encoding to generate a visual encoding for the document image, each visual feature from the plurality of visual features associated with at least one subset of characters from the plurality of subsets of characters; generate a parsed document based on the text encoding and the visual encoding; identify a plurality of sections based on the parsed document, each section from the plurality of sections uniquely associated with a section type from a plurality of predefined section types; execute, for each section from the plurality of sections, a machine learning model from a plurality of machine learning models that is uniquely associated with the section type of that section to extract at least a feature from a plurality of features from the parsed document, each feature from the plurality of features associated with a predefined feature type from a plurality of predefined feature types; and generate a structured data file based on the least a feature from the plurality of features extracted from the parsed document.
10. The non-transitory, processor-readable medium of claim 9, the instructions further comprising instructions to cause the processor to: generate a score for each feature from the plurality of features from the structured data; and remove a feature from the plurality of features based on the score for that feature being below a predetermined threshold.
11. The non-transitory, processor-readable medium of claim 9, the instructions further comprising instructions to cause the processor to train the plurality of machine learning models.
12. The non-transitory, processor-readable medium of claim 9, the instructions further comprising instructions to cause the processor to: receive a plurality of training document images; for each training document image from the plurality of training document images, identify at least one section associated with at least one section type from the plurality of predefined section types; and train a first machine learning model from the plurality of machine learning models that is associated with a first predefined section type from the plurality of predefined section types with a using data from the training document images that are associated with the first predefined section type.
13. The non-transitory, processor-readable medium of claim 9, the instructions further comprising instructions to cause the processor to: receive a plurality of training document images; for each training document image from the plurality of training document images, identify at least one section associated with at least one section type from the plurality of predefined section types; for each training document image from the plurality of training document images having a first predefined section type from the plurality of predefined section types, extract, from the first section type a text encoding and a visual feature associated with at least a portion of the text encoding; and train a first machine learning model from the plurality of machine learning models that is associated with the first predefined section type from the plurality of predefined section types with the text encoding and the visual feature.
14. The non-transitory, processor-readable medium of claim 9, wherein the document image is a resume.
15. The non-transitory, processor-readable medium of claim 9, wherein the structured data file is a JSON file.
16. The non-transitory, processor-readable medium of claim 9, wherein the instructions to cause the processor to execute, for each section from the plurality of sections, a machine learning model from a plurality of machine learning models further includes instructions to cause the processor to extract a first feature from the plurality of features from a first section type using a first machine learning model and extract a second feature from the plurality of features from a second section type using a second machine learning model.
17. The non-transitory, processor-readable medium of claim 9, wherein the instructions to cause the processor to: execute, for each section from the plurality of sections, a machine learning model from a plurality of machine learning models further includes instructions to cause the processor to extract a first feature from the plurality of features from a first section type using a first machine learning model and extract a second feature from the plurality of features from a second section type using a second machine learning model; and generate the structured data file further includes instructions to cause the processor to include the first feature and the second feature in the structured data file.
18. The non-transitory, processor-readable medium of claim 9, wherein the instructions to cause the processor to: execute, for each section from the plurality of sections, a machine learning model from a plurality of machine learning models further includes instructions to cause the processor to extract a first feature from the plurality of features from a first section type using a first machine learning model and extract a second feature from the plurality of features from a second section type using a second machine learning model; and generate the structured data file further includes instructions to cause the processor to include the first feature in the structure data file associated with a first section feature mapping and the second feature in the structured data file associated with a second section feature mapping.
19. The non-transitory, processor-readable medium of claim 9, wherein the instructions to cause the processor to: execute, for each section from the plurality of sections, a machine learning model from a plurality of machine learning models further includes instructions to cause the processor to extract a first feature from the plurality of features and a second feature from the plurality of features from a first section type using a first machine learning model; and generate the structured data file further includes instructions to cause the processor to include the first feature and the second feature in the structured data file.
20. An apparatus, comprising: a processor; and a memory operatively connected to the processor, the memory storing instructions to cause the processor to: receive a plurality of document images, each document image from the plurality of document images including a plurality of representations of characters; for each document image from the plurality of document images: parse that document image to extract a plurality of subsets of characters from the plurality of representations of characters to generate a text encoding for that document image, each subset of characters from the plurality of subsets of characters being associated with a structure type from a plurality of structure types and to generate a text encoding for that document image based on the plurality of subsets of characters; for each document image from the plurality of document images, extract a plurality of visual features to generate a visual encoding for that document image, each visual feature from the plurality of visual features associated with at least one subset of characters from the plurality of subsets of characters; generate a plurality of parsed documents, each parsed document from the plurality of parsed documents uniquely associated with a document image from the plurality of document images and based on the text encoding and the visual encoding for that document image; for each parsed document from the plurality of parsed documents, identify a plurality of sections, each section from the plurality of sections uniquely associated with a section type from a plurality of section types; and train a plurality of machine learning models to produce a plurality of trained machine learning models, each machine learning model from the plurality of machine learning models associated with one section type from the plurality of section types and trained using a portion of each parsed document from the plurality of parsed documents, the portion of each parsed document associated with that section type.

Description

METHODS AND APPARATUS FOR EXTRACTING DATA FROM A DOCUMENT BY ENCODING IT WITH TEXTUAL AND VISUAL FEATURES AND USING MACHINE LEARNING CROSS REFERENCE TO RELATED APPLICATIONS [0001] This application claims priority to provisional U.S. Patent Application No. 63/511,553, filed June 30, 2023, the entire contents of which are incorporated herein by reference. FIELD [0002] The present disclosure generally relates to the field of machine learning. In particular, the present disclosure is related to methods and apparatus for encoding a document image with textual and visual features and using machine learning to extract relevant data. BACKGROUND [0003] Object character recognition (OCR) is a tool that can convert text-containing documents having various format documents such as, for example, scanned copies, images, PDF files, etc., into computer readable, editable, and/or searchable format. Staffing and recruiting firms can use OCR to receive and analyze hundreds of resume documents of job candidates. In some implementations, any resume documents can follow a similar structure or format such that computers can learn to identify and extract specific information across a resume document. [0004] Some technologies fail to extract information accurately or consider visual features, such as bolding, font size, color, etc., and therefore rely heavily on user intervention to amend incorrect resume document conversion. Certain phrases or titles can be misinterpreted without the correct context, resulting in inaccurate information extraction or organization. Certain resume documents can be organized in unfamiliar structures or formats. A need exists to determine textual and visual features of individual sections to extract and label information accurately. SUMMARY [0005] In one or more embodiments, a non-transitory, processor-readable medium stores instructions that when executed by a processor, cause the processor to receive a set of document images each including a set of representations of characters. The processor is further caused to parse each document image from the set of document images to extract subsets of characters from the set of representations of characters to generate a text encoding for that document image. Each subset of characters is associated with a structure type from a set of structure types. For each document image from the set of document images, the processor is further caused to extract a set of visual features to generate a visual encoding for that document image. Each visual feature from the set of visual features is associated with at least one subset of characters from the subsets of characters. The processor is further caused to generate a set of parsed documents, each parsed document from the set of parsed documents uniquely associated with a document image from the set of document images and being based on the text encoding and the visual encoding for that document image. For each parsed document from the set of parsed documents, the processor is further caused to identify a set of sections. Each section from the set of sections uniquely is associated with a section type from a set of section types. The processor is further caused to train a set of machine learning models to produce a set of trained machine learning models. Each machine learning model from the set of machine learning models associated with one section type from the set of section types and trained using a portion of each parsed document that is from the set of parsed documents and that is associated with that section type. [0006] In one or more embodiments, a non-transitory, processor processor-readable medium stores instructions that, when executed by a processor, cause the processor to receive a document image that includes a set of representations of characters. The processor is further caused to parse the document image to extract subsets of characters from the set of representations of characters to generate a text encoding for the document image. Each subset of characters is associated with a structure type from a set of predefined structure types. The processor is further caused to extract a set of visual features based on the text encoding to generate a visual encoding for the document image, each visual feature from the set of visual features associated with at least one subset of characters from the subsets of characters. The processor is further caused to generate a parsed document based on the text encoding and the visual encoding. The processor is further caused to identify a set of sections based on the parsed document. Each section from the set of sections is uniquely associated with a section type from a set of predefined section types. The processor is further caused to execute, for each section from the set of sections, a machine learning model from a set of machine learning models that is uniquely associated with the section type of that section to extract at least a feature from a set of features from