Search

EP-4740117-A2 - AUTOMATED TRANSFORMATION OF INFORMATION FROM IMAGES TO TEXTUAL REPRESENTATIONS, AND APPLICATIONS THEREFOR

EP4740117A2EP 4740117 A2EP4740117 A2EP 4740117A2EP-4740117-A2

Abstract

Recent developments in machine learning (commonly coined "artificial intelligence" or "AI") have vastly expanded applications for this technology, such as myriad "chat" agents adept at understanding natural human language. While state of the art generative models can parse text queries from a user and provide comprehensive, accurate responses (including generating images depicting desired content), current implementations struggle with understanding all information present in images of documents, especially images of business documents. In particular, generative models fail to understand structured and semi-structured information, e.g., as indicated by graphical information such as lines, geometric relationships (e.g., indicated by tables, graphs, figures, etc.), formatting, and other contextual information that human readers easily and implicitly understand. The disclosed inventive concepts transform structured and semi-structured information along with textual content into a textual representation that allows generative models to better understand textual content and non-textual structured information present in document images.

Inventors

  • THOMPSON, STEVE
  • LEVDIK, VERONIKA
  • VYMENETS, IURII
  • LEE, DONGHAN

Assignees

  • TUNGSTEN AUTOMATION CORPORATION

Dates

Publication Date
20260513
Application Date
20240703

Claims (1)

  1. KFX1P122.P CLAIMS Generic Mosaic Claims What is claimed is: 1. A computer-implemented method for automated transformation of information present in one or more document images into a textual representation, the method comprising: hierarchically segmenting: a plurality of optical marks depicted in the one or more document images; a plurality of text elements depicted in the one or more document images; and a plurality of regions of the one or more document images, wherein the plurality of regions are defined by graphical lines depicted in the one or more document images; aggregating and reconciling the hierarchically segmented plurality of optical marks, the hierarchically segmented plurality of text elements, and the hierarchically segmented plurality of regions into a single hierarchy; and generating a textual narrative based on the single hierarchy. 2. The computer-implemented method as recited in claim 1, wherein hierarchically segmenting the plurality of optical marks comprises: determining a plurality of optical mark segments, wherein the plurality of optical mark segments comprise optical mark keys, optical mark selections, optical mark values, and optical mark groups; and determining relationships between some or all of the plurality of optical mark segments. 3. The computer-implemented method as recited in claim 1, wherein hierarchically segmenting the plurality of text elements comprises: determining a plurality of text blocks based at least in part on the plurality of text elements; and determining one or more text columns based at least in part on the plurality of text blocks.! KFX1P122.P 4. The computer-implemented method as recited in claim 1, wherein hierarchically segmenting the plurality of regions comprises: aligning some or all of the plurality of graphical lines; building a graph representing the aligned graphical lines; initializing a hierarchy representing the plurality of regions; and detecting additional elements and/or structures of the initialized hierarchy. 5. The computer-implemented method as recited in claim 4, wherein the graph is an orthogonal, quad-edge graph. 6. The computer-implemented method as recited in claim 4, wherein detecting the additional elements and/or structures of the initialized hierarchy comprises: finding a rectangular covering for the initialized hierarchy; searching for coverings within the initialized hierarchy; building the additional elements and/or structures into the initialized hierarchy; and merging child structures into the initialized hierarchy. 7. The computer-implemented method as recited in claim 1, wherein the aggregating and reconciling comprises: merging a first hierarchy representing the hierarchically segmented plurality of optical marks and a second hierarchy representing the hierarchically segmented plurality of text elements into a third hierarchy representing the hierarchically segmented plurality of regions to create the single hierarchy. 8. The computer-implemented method as recited in claim 1, wherein generating the textual narrative comprises: rendering the single hierarchy into a tree-based data structure. 9. The computer-implemented method as recited in claim 1, wherein the textual narrative is arranged according to a natural reading order of a language depicted in the one or more document images. KFX1P122.P 10. The computer-implemented method as recited in claim 1, wherein the single hierarchy represents structured information of the one or more document images and unstructured information of the one or more document images. 11. A computer-implemented method for automated transformation of information present in one or more document images into a textual representation, the method comprising: interpreting, using an intelligent narrator, structured information and unstructured information present on one or more document images to generate a narrative representing the structured information and the unstructured information. 12. The computer-implemented method as recited in claim 11, wherein generating the narrative comprises: performing image segmentation and/or text segmentation on the one or more document images, wherein performing the image segmentation and/or text segmentation identifies: a plurality of segments represented on the one or more document images; and any content within each segment; ordering the identified segments based at least in part on: geometric relationships between the plurality of segments; and a natural reading order of a language in which the content is represented; ordering the content, if any, within each segment, wherein the ordering of the content is based at least in part on the natural reading order of a language in which the content is represented; and generating a data structure representing a structure of the one or more document images, wherein the data structure is generated based at least in part on either or both of: the ordering of the plurality of segments; and/or the ordering of the content.! 13. The computer-implemented method as recited in claim 12, wherein the structure of the one or more document images comprises a hierarchy of the plurality of segments. KFX1P122.P 14. The computer-implemented method as recited in claim 12, wherein performing the text segmentation and/or the image segmentation further comprises determining a hierarchy of the plurality of segments. 15. The computer-implemented method as recited in claim 12, wherein each segment corresponds to a unique rectangular region represented within the one or more document images. 16. The computer-implemented method as recited in claim 12, wherein the natural reading order of the language comprises: a natural starting position for reading the language; a natural end position for reading the language; and an orientation of textual elements according to the language. 17. The computer-implemented method as recited in claim 16, wherein the natural starting position is selected from the group consisting of: a top-left portion of the one or more document images, a top-right portion of the one or more document images, a bottom-left portion of the one or more document images, a bottom-right portion of the one or more document images, a top-left portion of a given segment depicted in the one or more document images, a top-right portion of the given segment depicted in the one or more document images, a bottom-left portion of the given segment depicted in the one or more document images, and a bottom-right portion of the given segment depicted in the one or more document images; wherein the natural end position is selected from the group consisting of: the top-left portion of the one or more document images, the top-right portion of the one or more document images, the bottom-left portion of the one or more document images, the bottom-right portion of the one or more document images, the top-left portion of the given segment depicted in the one or more document images, the top-right portion of the given segment depicted in the one or more document images, the bottom-left portion of the given segment depicted in the one or more document images, and the bottom-right portion of the given segment depicted in the one or more document images; and wherein the orientation of textual elements is selected from the group consisting of: left-to-right, right-to-left, top-to-bottom, bottom-to-top, and combinations thereof. KFX1P122.P 18. The computer-implemented method as recited in claim 11, wherein the interpreting comprises employing, using the intelligent narrator, a plurality of expert modules, wherein each expert module is configured to perform one or more tasks selected from the group consisting of: image processing, text recognition, optical mark recognition, image segmentation, text segmentation, narrative generation, and page aggregation; and wherein the method further comprises: orchestrating execution of the plurality of expert modules; and resolving conflicts within some or all results of executing the plurality of expert modules to generate resolved results. 19. The computer implemented method as recited in claim 18, wherein some or all of the expert modules independently comprise at least one neural network. 20. The computer-implemented method as recited in claim 18, wherein the optical mark recognition comprises: identifying a plurality of optical marks depicted within the one or more document images, wherein the plurality of optical marks comprise a plurality of graphical lines each independently associated with one or more text elements; determining a status of the plurality of optical marks; and building an optical mark element hierarchy of the plurality of optical marks. 21. The computer-implemented method as recited in claim 20, wherein identifying the plurality of optical marks utilizes a visual object detection machine learning technique. 22. The computer-implemented method as recited in claim 20, wherein identifying the plurality of optical marks utilizes a heuristic image processing technique. 23. The computer-implemented method as recited in claim 20, wherein identifying the plurality of optical marks utilizes a combination of a visual object detection machine learning technique and a heuristic image processing technique. KFX1P122.P 24. The computer-implemented method as recited in claim 18, wherein the text segmentation comprises: creating a plurality of text segments based at least in part on text elements represented within the one or more documents; creating a plurality of text blocks based at least in part on the plurality of text segments; and creating a plurality of text columns based at least in part on the plurality of text blocks. 25. The computer-implemented method as recited in claim 18, wherein the image segmentation comprises: aligning a plurality of graphical lines depicted on the one or more document images; building a graph representing the aligned graphical lines; initializing a hierarchy based at least in part on the graph; detecting additional elements and/or structures of the hierarchy based at least in part on the graph. 26. The computer implemented method as recited in claim 25, further comprising aggregating and reconciling the additional elements and/or structures of the initialized hierarchy to generate a final hierarchy representing the structured information and the unstructured information present on the one or more document images. 27. The computer-implemented method as recited in claim 18, wherein the narrative generation comprises generating a data structure representing a hierarchy of the structured information and the unstructured information present on the one or more document images. 28. The computer-implemented method as recited in claim 27, wherein the hierarchy comprises: a plurality of optical mark recognition (OMR) elements recognized by performing the optical mark recognition task; a plurality of text segments, text blocks, and/or text columns identified by performing the text segmentation task; and a plurality of image segments identified by performing the image segmentation task. KFX1P122.P 29. The computer-implemented method as recited in claim 27, wherein the narrative is a textual representation of the hierarchy. 30. The computer-implemented method as recited in claim 29, wherein the textual representation is arranged according to a natural reading order of a language depicted in the one or more document images. 31. A computer-implemented method for recognizing optical mark (OMR) elements within one or more document images, the method comprising: identifying a plurality of optical marks based at least in part on a plurality of graphical lines depicted within the one or more document images; determining a status of the plurality of optical marks; building an optical mark element hierarchy based at least in part on the plurality of optical marks; and ordering the plurality of optical marks. 32. The computer-implemented method as recited in claim 31, wherein the optical marks comprise one or more check boxes, and/or one or more radio buttons. 33. The computer-implemented method as recited in claim 31, wherein identifying the plurality of optical marks utilizes a visual object detection machine learning technique. 34. The computer-implemented method as recited in claim 31, wherein identifying the plurality of optical marks utilizes a heuristic image processing technique. 35. The computer-implemented method as recited in claim 31, wherein identifying the plurality of optical marks utilizes a combination of a visual object detection machine learning technique and a heuristic image processing technique. 36. The computer-implemented method as recited in claim 31, wherein determining the status of the plurality of optical marks comprises: KFX1P122.P determining a black pixel density of a region within each optical mark, wherein the region is defined by graphical lines of the optical mark; and evaluating the black pixel density of the region within each optical mark against a predetermined black pixel density threshold. 37. The computer-implemented method as recited in claim 31, further comprising assigning a textual element to each of the plurality of optical marks. 38. The computer-implemented method as recited in claim 37, wherein the textual element assigned to each of the plurality of optical marks is either “selected” or “not selected”. 39. The computer-implemented method as recited in claim 31, wherein identifying the plurality of optical marks comprises: binarizing the one or more document images; applying a dilation algorithm to the one or more binarized document images; and applying an erosion algorithm to the one or more binarized document images. 40. The computer-implemented method as recited in claim 31, further comprising validating the plurality of optical marks. 41. The computer-implemented method of claim 40, wherein validating the plurality of optical marks comprises: identifying, from among the plurality graphical lines one or more horizontal graphical lines; and identifying, from among the one or more horizontal graphical lines, one or more candidate optical marks; and designating one or more of the candidate optical marks as a valid optical mark in response to determining the one or more candidate optical marks independently satisfy a plurality of criteria of valid optical marks. 42. The computer-implemented method as recited in claim 41, wherein the plurality of criteria of valid optical marks comprise: KFX1P122.P a first member of a given one of the one or more pairs of horizontal graphical lines is characterized by different y-axis coordinate value than a second member of the given one of the one or more pairs of horizontal graphical lines; and the first and second members of the pair of horizontal lines are characterized by a common x-axis coordinate value. 43. The computer-implemented method as recited in claim 42, wherein the plurality of criteria of valid optical marks further comprise: at least one vertical line oriented substantially perpendicular to the pair of horizontal lines intersects the pair of horizontal lines at a common x-axis coordinate thereof. 44. A computer-implemented method for building a hierarchy of optical mark (OMR) elements depicted within one or more document images, the method comprising: identifying a parent optical mark and one or more child optical marks belonging to the parent optical mark; determining whether the parent optical mark has only one child optical mark or more than one child optical mark; in response to determining the parent optical mark has only one child optical mark, adding the one child optical mark to a definition of the parent optical mark; in response to determining the parent optical mark has more than one child optical mark: determining a bounding box area of each of the more than one child optical mark; and adding at least one of the more than one child optical mark to the definition of the parent optical mark; and determining whether any other of the one or more child optical marks are spatially located within a bounding box of the one child optical mark added to the definition of the parent optical mark or the at least one of the more than one child optical mark added to the definition of the parent optical mark; and in response to determining any other of the one or more child optical marks are spatially located within the bounding box of the one child optical mark added to the definition of the parent optical mark or the at least one of the more than one child optical mark added to the definition of the parent optical mark, adding the other of the one or more child optical marks KFX1P122.P as sub-children optical marks of the one child optical mark added to the definition of the parent optical mark or the at least one of the more than one child optical mark added to the definition of the parent optical mark. 45. The computer-implemented method as recited in claim 44, wherein identifying the parent optical mark and the one or more child optical marks belonging to the parent optical mark is based at least in part on analyzing a plurality of graphical lines depicted within the one or more document images using a visual object detection machine learning technique and/or a heuristic image processing technique. 46. The computer-implemented method as recited in claim 44, further comprising validating the hierarchy of optical mark elements. 47. The computer-implemented method as recited in claim 46, wherein validating the hierarchy of optical mark elements comprises: evaluating a type of the optical mark elements; and evaluating relationships between the optical mark elements, wherein the relationships are established by the hierarchy. 48. The computer-implemented method as recited in claim 46, wherein validating the hierarchy of optical mark elements comprises: determining whether any child optical mark element having a type “OMR Key” or “OMR Selection” is not designated as a child of a parent optical mark element having a type “OMR Group”; and in response to determining any child optical mark element having the type “OMR Key” or “OMR Selection” is not designated as a child of a parent optical mark element having the type “OMR Group”, removing the child optical mark from the hierarchy. 49. The computer-implemented method as recited in claim 46, wherein validating the hierarchy of optical mark elements comprises: determining whether any child optical mark element having a type “OMR Key” or “OMR Selection” has a sub-child optical mark element; and KFX1P122.P in response to determining any child optical mark element having the type “OMR Key” or “OMR Selection” has a sub-child optical mark element, removing the child optical mark from the hierarchy. 50. A computer-implemented method for ordering optical mark (OMR) elements, the method comprising: determining a y-coordinate value of a central vertical point for each of one or more child optical mark elements of a given parent optical mark element; grouping some or all of the one or more child mark elements into at least one line of child optical mark elements based at least in part on the determined y-coordinate value(s) thereof; determining an x-coordinate value of a central horizontal point for each of the one or more child optical mark elements of the given parent optical mark element; and ordering some or all of the one or more child optical mark elements based at least in part on the determined x-coordinate value(s) and y-coordinate value(s) thereof. 51. The computer-implemented method as recited in claim 50, further comprising sorting the one or more child optical mark elements based at least in part on: the determined x-coordinate value(s) thereof; the determined y-coordinate value(s) thereof; or both the determined x-coordinate value(s) thereof and the determined y-coordinate value(s) thereof. 52. A computer-implemented method for segmenting text depicted within one or more document images, the method comprising: identifying a plurality of text elements within the one or more document images; building a plurality of text segments based at least in part on the plurality of text elements; building a plurality of text blocks based at least in part on the plurality of text segments; and building one or more text columns based at least in part on the plurality of text blocks. KFX1P122.P 53. The computer-implemented method as recited in claim 52, wherein each of the plurality of text elements independently comprises one or more connected components represented in the one or more document images; and wherein each of the plurality of text elements independently correspond to one or more physical marking on a physical document depicted in the one or more document images. 54. The computer-implemented method as recited in claim 52, wherein the plurality of text segments each independently comprise an ordered plurality of some or all of the text elements; and wherein each ordered plurality of the some or all of the text elements are independently associated with one another in the one or more document images. 55. The computer-implemented method as recited in claim 52, wherein the plurality of text blocks each independently comprise a combination of two or more of the plurality of text segments that meet a predetermined set of geometric criteria and/or a predetermined set of visual criteria. 56. The computer-implemented method as recited in claim 52, wherein the one or more text columns each independently comprise a predetermined set of text blocks that meet: a predetermined set of geometric criteria; a predetermined set of visual criteria; a predetermined set of semantic criteria; or any combination of the predetermined set of geometric criteria, the predetermined set of visual criteria, and the predetermined set of semantic criteria. 57. The computer-implemented method as recited in claim 52, wherein identifying the plurality of text elements within the one or more document images, building the plurality of text segments based at least in part on the plurality of text elements, building the plurality of text blocks based at least in part on the plurality of text segments, and building the one or more text columns based at least in part on the plurality of text blocks each independently utilize one or more predefined parameters selected from the group consisting of: a “Containing Percentage” parameter, a “Vertical Element Threshold” parameter, a “Vertical KFX1P122.P Distance Threshold” parameter, a “Horizontal Intersection Threshold” parameter, a “Vertical Distance Threshold for Columns” parameter, a “Horizontal Intersection Threshold for Columns” parameter, a “Horizontal Distance Threshold” parameter, a “Join Overlapping Text Blocks” parameter, a “Join Nested and/or Overlapping Columns” parameter, and combinations thereof. 58. A computer-implemented method for creating text blocks from text elements depicted within one or more document images, the method comprising: building one or more text segments from some or all of the text elements that satisfy a first set of predetermined criteria; joining some or all of the one or more text segments into a set of one or more joined text blocks based at least in part on evaluating the one or more text segments against a second set of predetermined criteria and add the set of one or more joined text blocks to a set of text blocks; joining two or more overlapping text blocks within the set of text blocks; ordering the set of text blocks based at least in part on evaluating the set of text blocks against a third set of predetermined criteria. 59. The computer-implemented method as recited in claim 58, further comprising adding, to the set of text blocks, one or more list blocks, wherein the one or more list blocks each independently comprise one or more text lines designated as a list using a layout analysis and zone identification technique. 60. The computer-implemented method as recited in claim 58, further comprising adding, to the set of text blocks, one or more related text blocks, wherein the one or more related text blocks each independently comprise one or more text elements within a same geometric neighborhood of one another. 61. The computer-implemented method as recited in claim 58, wherein the first set of predetermined criteria comprise, for a first of the text elements “A” and a second of the text elements “B”: (1) a leftmost x-coordinate of B is greater or equal to a leftmost x-coordinate of A; (2) a common vertical span of A and B is greater or equal to half of a height of B; KFX1P122.P (3) A and B are not separated by any vertical line(s); and (4) a horizontal distance between the two text elements A and B is less than a product of the value of a Horizontal Distance Threshold parameter and an average width of characters appearing in the one or more document images. 62. The computer-implemented method as recited in claim 58, wherein the second set of predetermined criteria comprise, for a first of the text elements “A” and a second of the text elements “B”: (1) an uppermost y-coordinate of B is greater than or equal to an uppermost y- coordinate of A; (2) A and B are either left-aligned or center-aligned; (3) !"#$%& '()* , & +-./* 01 %2 '()* , 2 +-./ 034567 ×8,!wherein Aleft and Aright are leftmost and the rightmost coordinates of A, B left and B right are the leftmost and the rightmost coordinates of B, Len is a length of an intersection of two intervals, HIT is a value of a Horizontal Intersection Threshold parameter, and W is a width of a document depicted in the one or more document images; (4) 2 *9: ;<& =9**9> ? @A7< ×<5 BC. , wherein Btop is a top coordinate of B, Abottom is a bottom coordinate of A, VDT is a value of a Vertical Distance Threshold parameter, and H avg is an average height of text elements in the document; (5) A and B are not separated by any horizontal graphical lines or any discovered text blocks; (6) A and B each independently contain more than two of the text elements or contain a text element with at least two non-punctuation characters; and (7) a value of a y-coordinate extent of A is at most twice a value of a y-coordinate extent of B, and a value of a y-coordinate extent of B is at most twice a value of a y- coordinate extent of A. 63. The computer-implemented method as recited in claim 58, wherein the third set of predetermined criteria comprise, for a first of the text elements “A” and a second of the text elements “B”: KFX1P122.P 3) Mf<NOtop<I<PbottomQ<ORS<NPtop<I<ObottomQ<ORS<NOright<?<PleftQT<UVWX;< wherein Atop, Abottom, Aleft, Aright are respectively a top, a bottom, a leftmost and a rightmost coordinate of A; and wherein B top , B bottom , B left , B right are respectively a top, a bottom, a leftmost and a rightmost coordinate of B. 64. The computer-implemented method as recited in claim 58, further comprising identifying connected components of the set of text blocks. 65. A computer-implemented method for creating text columns from text blocks depicted within one or more document images, the method comprising: creating, from a plurality of text blocks, a set of one or more text columns based at least in part on evaluating the plurality of text blocks against one or more predetermined criteria; joining, from among the set of text columns, any nested columns and/or any overlapping columns based at least in part on evaluating connected component(s) thereof; splitting one or more columns within the set of text columns based at least in part on a predominant alignment thereof; adding, to the set of text columns, a new text column for each of any list block(s) that do not belong already to the set of text columns; designating one or more columns within the set of text columns as data text columns based at least in part on a presence of either: a data text element, a data text segment, a data text block, or any combination thereof, in the one or more columns within the set of text columns; KFX1P122.P in response to determining a column within of the set of text columns overlaps vertically with a data text column, designating the overlapping text column as a table column; and discarding, from the set of text columns, the data text columns and the table columns. 66. The computer-implemented method as recited in claim 65, wherein the predetermined set of criteria comprise, for a first text block “A” and a second text block “B” each in the plurality of text blocks: (1) an uppermost y-coordinate value of B is greater than or equal to a lowermost y-coordinate value of A; (2) !"#$%& '()* , & +-./ 01 %2 '()* , 2 +-./* 034567] × ^_`N8 a ,8 c Q, wherein Aleft is a leftmost x-coordinate of A, A right is a rightmost x-coordinates of A, B left is a leftmost x- coordinate of B, B right is a rightmost x-coordinates of B, Len is a length of an intersection of two intervals, HITC is a value of a Horizontal Intersection Threshold for Columns parameter, WA is a width of A, and WB is a width of B; (3) 2 *9: ;<& =9**9> ? @A7]< ×<5 BC. ,!wherein!B top !is a top y-coordinate of B, Abottom is a bottom y-coordinate of A, VDTC is a value of a Vertical Distance Threshold for Columns parameter, and Havg is an average height of text elements appearing in the one or more document images; and (4) A and B are not horizontally separated by any horizontal lines or text blocks. 67. The computer-implemented method as recited in claim 65, further comprising removing select first pairs of text blocks from the set of text blocks; wherein the select first pairs of text blocks initiate at a node of a directed edge graph representing one of the text blocks; and wherein more than one edge of the directed edge graph initiates at said node of the directed edge graph. 68. The computer-implemented method as recited in claim 67, further comprising removing select second pairs of text blocks from the set of text blocks; wherein the select second pairs of text blocks terminate at a node of a directed edge graph representing one of the text blocks; and KFX1P122.P wherein more than one edge of the directed edge graph terminates at said node of the directed edge graph. 69. The computer-implemented method as recited in claim 65, further comprising defining connected components of the set of text columns. 70. The computer-implemented method as recited in claim 65, further comprising identifying one or more maximal subcolumns within the set of text columns based at least in part on a predominant alignment of the plurality of text blocks. 71. The computer-implemented method as recited in claim 65, wherein designating the overlapping text column as a table column is performed in response to determining the overlapping text column and the data text column overlap by at least 50% of a vertical span thereof. 72. The computer-implemented method as recited in claim 65, wherein the data text elements, the data text segments, and the data text blocks each independently have one or more characteristics selected from the group consisting of: at least a predetermined percentage of text elements being numerical; having a predefined format; being associated with a predefined symbol or substring; and combinations thereof. 73. A computer-implemented method for segmenting graphical elements depicted within one or more document images, the method comprising: aligning some or all of a plurality of graphical lines depicted within the one or more document images; building a graph representing the aligned graphical lines; initializing a hierarchy representing one or more regions of the one or more document images, wherein the one or more regions are defined by the aligned graphical lines; detecting additional elements and/or structures of the initialized hierarchy; detecting uniform grids within the hierarchy; and aggregating and reconciling the hierarchy. KFX1P122.P 74. The computer-implemented method as recited in claim 73, wherein aligning the some or all of the plurality of graphical lines comprises: identifying one or more one-dimensional clusters based on connected components of the some or all of the plurality of graphical lines; and independently identifying one or more connected sub-components within the one or more one-dimensional clusters. 75. The computer-implemented method as recited in claim 74, wherein aligning the some or all of the plurality of graphical lines further comprises: sorting the some or all of the plurality of graphical lines based at least in part on x- coordinate values thereof and/or y-coordinate values thereof; joining overlapping ones of the plurality of graphical lines; and removing ones of the plurality of graphical lines that have one or more characteristics selected from the group consisting of: having a length less than a predetermined threshold; spanning an entire x-axis or an entire y-axis of the one or more document images; and combinations thereof. 76. The computer-implemented method as recited in claim 73, wherein the graph is an orthogonal, quad-edge graph. ! 77. The computer-implemented method as recited in claim 73, wherein detecting the additional elements and/or structures of the initialized hierarchy comprises: finding a rectangular covering for the initialized hierarchy; searching for coverings within the initialized hierarchy; building the additional elements and/or structures into the initialized hierarchy; and merging child structures into the initialized hierarchy. 78. The computer-implemented method as recited in claim 77, wherein searching for the coverings within the initialized hierarchy comprises detecting TEETH structures and embedding the TEETH structures into the initialized hierarchy. KFX1P122.P 79. The computer-implemented method as recited in claim 73, wherein aggregating and reconciling the hierarchy comprises adding optical marks and text to the hierarchy. 80. The computer-implemented method as recited in claim 73, wherein the aggregated and reconciled hierarchy is a single, tree-based data structure representing text, optical marks, and graphical lines depicted in the one or more document images. 81. A computer-implemented method for building a graph representing regions defined by graphical elements depicted within one or more document images, the method comprising: identifying, within an orthogonal, quad-edge graph representing horizontal lines and vertical lines present in an image of a document, a plurality of vertices and a plurality of connections between adjacent vertices; creating, within the orthogonal quad-edge graph, a plurality of connections between endpoints of each of the horizontal lines and endpoints of each of the vertical lines; building, within the orthogonal quad-edge graph, a set of quad edges having edge relations based on the plurality of vertices and the plurality of connections; creating a plurality of faces based on the set of quad edges; and. determining a covering rectangle for some or all of the plurality of faces. 82. The computer-implemented method as recited in claim 81, wherein adjacent vertices are characterized by being a closest pair of vertices located in either an upward, a downward, a leftward, or a rightward direction of one another. 83. The computer-implemented method as recited in claim 81, further comprising identifying intersections between some or all of the horizontal lines and some or all of the vertical lines. 84. The computer-implemented method as recited in claim 81, wherein building the set of quad edges comprises creating directed edges and directional relationships therebetween. 85. The computer-implemented method as recited in claim 81, wherein creating the plurality of faces comprises creating a left face for each edge in the set of quad edges that KFX1P122.P does not already have a defined left face connecting all edges bounding said left face in a rectangular path. 86. The computer-implemented method as recited in claim 81, wherein creating the plurality of faces comprises defining a bounding path for each edge in the set of quad edges in a counter clockwise direction. ! 87. A computer-implemented method for detecting a hierarchy within a graph representing graphical lines depicted in one or more document images, the method comprising: finding a rectangular covering for the initialized hierarchy; searching for coverings within the initialized hierarchy; detecting and building additional elements and/or structures into the hierarchy; and merging child structures into the hierarchy. 88. The computer-implemented method as recited in claim 87, wherein finding the rectangular covering for the initialized hierarchy comprises extracting a rectangular covering for a given set of edges in the graph; wherein the extracting is based at least in part on a given starting rectangle; and wherein the rectangular covering is contained within a maximum rectangle. ! 89. The computer-implemented method as recited in claim 87, wherein searching for the coverings within the initialized hierarchy comprises building a covering hierarchy ] for: a given set of inner edges of the graph H -dd(+ ; a starting bounding rectangle of the graph K-, wherein the starting bounding rectangle delimits a plurality of inner edges and a plurality of bounding edges of a covering from edges external to the covering hierarchy ]; and a maximum rectangle K >Bj . ! 90. The computer-implemented method as recited in claim 87, wherein detecting and building the additional elements and/or structures into the hierarchy comprises: KFX1P122.P detecting a hierarchy for a given set of faces of the graph, wherein the given set of faces are within a given parental structure of the graph; and building the detected hierarchy into an initialized hierarchy, wherein the initialized hierarchy corresponds to an entirety of a page depicted in one of the one or more document images. ! 91. The computer-implemented method as recited in claim 87, wherein the hierarchy comprises a single, tree-based data structure representing the one or more regions of the one or more document images.

Description

KFX1P122.P AUTOMATED TRANSFORMATION OF INFORMATION FROM IMAGES TO TEXTUAL REPRESENTATIONS, AND APPLICATIONS THEREFOR RELATED APPLICATIONS [0001] The present application claims the benefit of priority to U.S. Provisional Patent Application No. 63/524,745, filed July 3, 2023 and entitled “AUTOMATED TRANSFORMATION OF INFORMATION FROM IMAGES TO TEXTUAL REPRESENTATIONS, AND APPLICATIONS THEREFOR.” FIELD OF INVENTION [0002] The present invention relates to transforming information represented in image form into a textual representation (or equivalent thereof, such as an audio representation) to facilitate comprehension thereof, particularly comprehension of semi-structured and structured information present within the image representation. [0003] The inventive concepts presented herein have myriad applications, including but not limited to improving the function of Large Language Models (LLMs) by extending the ability to comprehend structured and semi-structured information presented in images, particularly images of documents, and even more particularly images of business documents. The descriptions provided herein accordingly refer to various applications of the inventive concepts in the context of LLMs. It shall be understood that such descriptions are provided by way of example, not limitation, and other applications of the inventive concepts disclosed herein that will be appreciated by skilled artisans upon reviewing the specification and drawings and are to be considered within the scope of the present application. BACKGROUND [0004] Images, particularly images of documents, and even more particularly images of “large” and/or “complex” documents like financial reports, medical charts, explanation-of-benefits documents, etc. often contain large volumes of diverse data. The data are diverse with respect to the formatting, content, extent (e.g., single/multi-page), and/or layout, even among similar document types (e.g., the same type of document prepared by different entities and/or according to different conventions, organizational schemes, languages, etc., may exhibit KFX1P122.P drastically different organization, expression, arrangement, etc. despite depicting the same content or substantially the same content (also referred to herein as “unstructured information”)). [0005] Indeed, there is a long-felt need in the field of document analysis and processing for techniques, especially automated, computerized techniques, for accurately and faithfully processing and analyzing the information represented within images of documents despite the vast volume and extensive diversity with which that information may be presented. [0006] Existing tools such as character recognition of various types (particularly optical character recognition (OCR)), object recognition, etc. for analyzing information present in images have advanced to the point of being capable of detecting, extracting, and analyzing (comprehending) various aspects of images, especially (unstructured) textual information. In addition, advancement in natural language processing techniques, in particular techniques based on large neural networks (often referred to as “deep learning” or “artificial intelligence”), has recently resulted in development of so-called “generative” models such as OpenAI’s CHATGPT®, Google’s BARD®, etc. that display new capabilities to process textual input and respond to complex prompts, inquiries, and perform unique tasks such as creative composition of “new” material, such as essays, songs, poems, images, etc. [0007] However, these generative models remain under extensive development, and ongoing efforts seek to improve the models’ capabilities, particularly regarding comprehension of input, and interpretation of context within the vast sources of information used to perform requested tasks. Exemplary problems observed to-date include chat agents behaving inappropriately, from overt examples such as attempting to persuade users to undertake detrimental actions or acting in a hostile manner, to more insidious problems such as exposure of confidential information or improper application of bias. These problems can have significant real-world impact (a chat bot behaving like an obsessive stalker, comparing users to reviled historical figures, discriminating against certain populations, making threats, etc.) and must be addressed in order to realize safe, reliable operation of generative models including but not limited to LLMs. [0008] These problems, among myriad others that will be appreciated by those having utilized state of the art generative models, arise in part due to the limited capability of the model to fully understand the context of the document, particularly context represented in structured information. Current techniques for understanding information in documents include extensive supervised learning/training, in which a human user manually defines field(s) within a KFX1P122.P document, and provides guidance reg