CN-115730032-B - Method, system and computer program product for adaptive document understanding
Abstract
A method, system, and program are provided for creating a plurality of page clusters in a feature space from a plurality of feature vectors corresponding to a plurality of unstructured pages. The method, system, and program product assign one of a plurality of machine learning models to each of a plurality of page clusters based on a relationship in the feature space between the plurality of page clusters and a plurality of training clusters corresponding to the plurality of machine learning models. The method, system, and program product identify a page of a plurality of page clusters that corresponds to a selected one of a plurality of unstructured pages and convert the selected unstructured page to a structured page using a selected one of a plurality of machine learning models assigned to the identified page cluster.
Inventors
- Generalized vector
- Kakenawa Hiroshi
- LIU XIANGNING
- ONO ASAKO
Assignees
- 国际商业机器公司
Dates
- Publication Date
- 20260505
- Application Date
- 20220808
- Priority Date
- 20210825
Claims (20)
- 1. A computer-implemented method, comprising: creating a plurality of page clusters in a feature space according to a plurality of feature vectors corresponding to a plurality of unstructured pages; Assigning one of the plurality of machine learning models to each of the plurality of page clusters based on a relationship between the plurality of page clusters and a plurality of training clusters corresponding to the plurality of machine learning models in the feature space comprises: calculating a plurality of page cluster centers based on the plurality of page clusters; Calculating a plurality of training cluster centers based on a plurality of training clusters corresponding to the plurality of machine learning models; Selecting one of the plurality of page cluster centers; identifying one of the plurality of training cluster centers in the feature space that is closest to the selected page cluster center, and Assigning one of the plurality of machine learning models corresponding to the identified training center cluster to a page cluster corresponding to the selected page cluster center; Identifying a page cluster of the plurality of page clusters corresponding to a selected one of the plurality of unstructured pages, and The selected unstructured page is converted to a structured page using a selected one of the plurality of machine learning models assigned to the identified page cluster.
- 2. The method of claim 1, further comprising: dividing a plurality of unstructured documents into the plurality of unstructured pages; selecting one unstructured page of the plurality of unstructured pages; defining a set of character regions and corresponding set of positions in the selected unstructured page, and A set of character region feature vectors corresponding to the set of character regions is calculated based on the set of corresponding positions of the set of character regions and the set of content within the character region corresponding to the positions in the set of positions.
- 3. The method of claim 2, further comprising: calculating a selected one of the plurality of feature vectors for the selected unstructured page based on the set of character region feature vectors, and The selected feature vector is mapped to the feature space.
- 4. A method according to claim 3, further comprising: Performing hierarchical clustering on the selected feature vectors, wherein the hierarchical clustering further comprises: identifying one of a plurality of page cluster centers corresponding to the plurality of page clusters that is closest in feature space to the selected feature vector, and The selected feature vector is added to the identified one of the plurality of page clusters corresponding to the identified page cluster center.
- 5. The method of claim 1, further comprising: Identifying different ones of the plurality of page clusters corresponding to different ones of the plurality of unstructured pages, and The different unstructured pages are converted to different structured pages using different ones of the plurality of machine learning models assigned to the different page clusters.
- 6. The method of claim 1, further comprising: Training the selected machine learning model using a portion of the plurality of unstructured documents corresponding to the identified page clusters; performing the conversion using a trained machine learning model, and A trained machine learning model is added to the plurality of machine learning models.
- 7. The method of claim 1, wherein the plurality of unstructured pages comprises a plurality of unstructured page types, and wherein each unstructured page type of the plurality of unstructured page types is assigned to one of the plurality of machine learning models to perform the conversion.
- 8. An information processing system, comprising: One or more processors; A memory coupled to at least one of the processors; a set of computer program instructions stored in the memory and executed by at least one of the processors to perform the actions of: creating a plurality of page clusters in a feature space according to a plurality of feature vectors corresponding to a plurality of unstructured pages; Assigning one of the plurality of machine learning models to each of the plurality of page clusters based on a relationship between the plurality of page clusters and a plurality of training clusters corresponding to the plurality of machine learning models in the feature space comprises: calculating a plurality of page cluster centers based on the plurality of page clusters; Calculating a plurality of training cluster centers based on a plurality of training clusters corresponding to the plurality of machine learning models; Selecting one of the plurality of page cluster centers; identifying one of the plurality of training cluster centers in the feature space that is closest to the selected page cluster center, and Assigning one of the plurality of machine learning models corresponding to the identified training center cluster to a page cluster corresponding to the selected page cluster center; Identifying a page cluster of the plurality of page clusters corresponding to a selected one of the plurality of unstructured pages, and The selected unstructured page is converted to a structured page using a selected one of the plurality of machine learning models assigned to the identified page cluster.
- 9. The information handling system of claim 8, wherein the processor performs further actions comprising: dividing a plurality of unstructured documents into the plurality of unstructured pages; selecting one unstructured page of the plurality of unstructured pages; defining a set of character regions and corresponding set of positions in the selected unstructured page, and A set of character region feature vectors corresponding to the set of character regions is calculated based on the corresponding set of positions of the character regions and the set of content within the character region corresponding to the positions in the set of positions.
- 10. The information handling system of claim 9, wherein the processor performs further actions comprising: calculating a selected one of the plurality of feature vectors for the selected unstructured page based on the set of character region feature vectors, and The selected feature vector is mapped to the feature space.
- 11. The information handling system of claim 10 wherein the processor performs further actions comprising: Performing hierarchical clustering on the selected feature vectors, wherein the hierarchical clustering further comprises: Identifying one of a plurality of page cluster centers corresponding to the plurality of page clusters that is closest to the selected feature vector in the feature space, and The selected feature vector is added to the identified one of the plurality of page clusters corresponding to the identified page cluster center.
- 12. The information handling system of claim 8, wherein the processor performs further actions comprising: Identifying different ones of the plurality of page clusters corresponding to different ones of the plurality of unstructured pages, and The different unstructured pages are converted to different structured pages using different ones of the plurality of machine learning models assigned to the different page clusters.
- 13. The information handling system of claim 8, wherein the processor performs further actions comprising: Training the selected machine learning model using a portion of the plurality of unstructured documents corresponding to the identified page clusters; performing the conversion using a trained machine learning model, and A trained machine learning model is added to the plurality of machine learning models.
- 14. The information handling system of claim 8, wherein the plurality of unstructured pages comprises a plurality of unstructured page types, and wherein each of the plurality of unstructured page types is assigned one of the plurality of machine learning models to perform the conversion.
- 15. A computer program product stored in a computer readable storage medium, comprising computer program code which, when executed by an information handling system, causes the information handling system to perform acts comprising: creating a plurality of page clusters in a feature space according to a plurality of feature vectors corresponding to a plurality of unstructured pages; Assigning one of the plurality of machine learning models to each of the plurality of page clusters based on a relationship between the plurality of page clusters and a plurality of training clusters corresponding to the plurality of machine learning models in the feature space comprises: calculating a plurality of page cluster centers based on the plurality of page clusters; Calculating a plurality of training cluster centers based on a plurality of training clusters corresponding to the plurality of machine learning models; Selecting one of the plurality of page cluster centers; identifying one of the plurality of training cluster centers in the feature space that is closest to the selected page cluster center, and Assigning one of the plurality of machine learning models corresponding to the identified training center cluster to a page cluster corresponding to the selected page cluster center; Identifying a page cluster of the plurality of page clusters corresponding to a selected one of the plurality of unstructured pages, and The selected unstructured page is converted to a structured page using a selected one of the plurality of machine learning models assigned to the identified page cluster.
- 16. The computer program product of claim 15, wherein the information handling system performs further actions comprising: dividing a plurality of unstructured documents into the plurality of unstructured pages; selecting one unstructured page of the plurality of unstructured pages; defining a set of character regions and corresponding set of positions in the selected unstructured page, and A set of character region feature vectors corresponding to the set of character regions is calculated based on the corresponding set of positions of the character regions and the set of content within the character region corresponding to the positions in the set of positions.
- 17. The computer program product of claim 16, wherein the information handling system performs further actions comprising: calculating a selected one of the plurality of feature vectors for the selected unstructured page based on the character region feature vector set, and The selected feature vector is mapped to the feature space.
- 18. The computer program product of claim 17, wherein the information handling system performs further actions comprising: Performing hierarchical clustering on the selected feature vectors, wherein the hierarchical clustering further comprises: Identifying one of a plurality of page cluster centers corresponding to the plurality of page clusters that is closest to the selected feature vector in the feature space, and The selected feature vector is added to the identified one of the plurality of page clusters corresponding to the identified page cluster center.
- 19. The computer program product of claim 15, wherein the information handling system performs further actions comprising: Identifying different ones of the plurality of page clusters corresponding to different ones of the plurality of unstructured pages, and The different unstructured pages are converted to different structured pages using different ones of the plurality of machine learning models assigned to the different page clusters.
- 20. The computer program product of claim 15, wherein the information handling system performs further actions comprising: Training the selected machine learning model using a portion of the plurality of unstructured documents corresponding to the identified page clusters; performing the conversion using a trained machine learning model, and A trained machine learning model is added to the plurality of machine learning models.
Description
Method, system and computer program product for adaptive document understanding Background Machine learning algorithms construct a machine learning model based on sample data (referred to as training data) to make predictions or decisions without being explicitly programmed. The process of training the machine learning model involves providing training data learned therefrom to a machine learning algorithm, and the work piece (ARTIFACT CREATED) created from the training process is the machine learning model. The training data includes correct answers, known as targets or target attributes, and the machine learning algorithm finds a pattern in the training data that maps the input data attributes to the target attributes and outputs a machine learning model that captures the pattern. Structured data refers to data that resides in fixed fields within a file or record and is therefore easy to analyze. Unstructured data (or unstructured information) is information that does not have a predefined data model or is not organized in a predefined manner. Unstructured information is typically text-intensive, but may include data such as dates, numbers, and the like. Furthermore, unstructured data often has irregularities and ambiguities that are difficult for conventional programs to interpret. An intelligent document understanding (SDU) method converts unstructured documents into structured data through machine learning. In the SDU, a user inputs annotations on a training document extracted from an input document, and trains a model using the document as a teaching image. However, the challenge currently found with SDU systems is that page formats often differ from page to page and from article to article. Some pages may be in a 2-column format, others may include graphic images, and others may be conventional paragraph-based letters. As such, using a single machine learning model to cover different page formats is difficult and results in reduced conversion accuracy. Furthermore, existing SDU training methods have a minimal amount of training data extraction that is effective for training a machine learning model (e.g., random sampling). As such, documents biased toward a particular format may be selected, which also results in a reduction in the conversion accuracy of the machine learning model. Disclosure of Invention According to one embodiment of the present disclosure, a method, system, and program are provided in which a plurality of clusters are created in a feature space from multiple feature vectors corresponding to a plurality of unstructured pages. The method, system, and program product assign one of the multiple machine learning models to each of the multiple clusters based on a relationship in a feature space between the multiple page clusters and a plurality of training clusters corresponding to the multiple machine learning models. The method, system, and program product identify a page of a plurality of page clusters that corresponds to a selected one of a plurality of unstructured pages and convert the selected unstructured page to a structured page using a selected one of a plurality of machine learning models assigned to the identified page cluster. In this embodiment, the method, system, and program product improve the accuracy of data conversion by adaptively selecting a best fit machine learning model from a plurality of machine learning models to convert unstructured data to structured data. According to another embodiment of the present disclosure, a method, system, and program product are provided for separating an unstructured document into a plurality of unstructured pages. The method, system, and program product select one of a plurality of unstructured pages and define a set of character areas and a corresponding set of locations in the selected unstructured page. The method, system, and program product calculate a set of character region feature vectors corresponding to a set of character regions based on a corresponding set of locations of the character regions and a set of content within their corresponding character regions. In this embodiment, the method, system, and program product calculate a plurality of trim feature vectors for each unstructured page based on the content type and location within the page. According to another embodiment of the present disclosure, a method, system, and program product are provided for computing a selected one of a plurality of feature vectors of a selected unstructured page based on a character region feature vector set and mapping the selected feature vector to a feature space. In this embodiment, the method, system and program product combine multiple feature vectors for a particular unstructured page into a single page feature vector that best describes fine-tuning of content type and content positioning in the unstructured page. According to another embodiment of the present disclosure, a method, system, and program product are provided in