CN-121981081-A - Document element merging method and device, electronic equipment and storage medium

CN121981081ACN 121981081 ACN121981081 ACN 121981081ACN-121981081-A

Abstract

The embodiment of the application provides a method and a device for merging document elements, electronic equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the steps of obtaining an original document comprising at least two document pages, and carrying out target identification to obtain a page element area and frame coordinates. And calculating the space coordinates of the region according to the frame coordinates. And respectively determining a horizontal interval coordinate interval and a vertical interval coordinate interval according to all the interval coordinates of the areas so as to determine a vertical demarcation area. And dividing the page image based on the vertical demarcation region to obtain column blocks, and sequencing the page element regions according to the column blocks. And for every two adjacent column blocks, taking the last page element area of the previous column block and the first page element area of the subsequent column block as target element areas according to the sequence identification, and merging the two target element areas according to the element types of the target element areas. The embodiment of the application can improve the accuracy of document merging.

Inventors

ZHU JING
LI JI
JIANG YANG
WANG YAN
Gong Kerui
GUO RUNTING
MAO RUIBIN
YANG JIANMING

Assignees

深圳证券信息有限公司

Dates

Publication Date: 20260505
Application Date: 20251210

Claims (10)

1. A method of merging document elements, the method comprising: Acquiring an original document, wherein the original document comprises at least two document pages, and carrying out target recognition on a page image of each document page to obtain at least two page element areas and frame coordinates of each page element area; calculating region interval coordinates according to the frame coordinates, wherein the region interval coordinates are coordinate points between every two adjacent page element regions in a preset horizontal direction; Respectively determining a horizontal interval coordinate interval in the horizontal direction and a vertical interval coordinate interval in the vertical direction according to all the region interval coordinates, and determining a vertical demarcation region according to the vertical interval coordinate interval and the horizontal interval coordinate interval, wherein the vertical direction is perpendicular to the horizontal direction; dividing the page image based on the vertical demarcation region to obtain at least two column blocks, and sorting the page element regions according to the column blocks to obtain sorting identification of each page element region, wherein the column blocks comprise at least one page element region; for every two adjacent column blocks in the page image, taking the last page element area of the previous column block and the first page element area of the next column block as target element areas according to the sorting identification; And merging the two target element areas according to the element types of the target element areas.
2. The method according to claim 1, wherein the determining a horizontal interval coordinate section in the horizontal direction and a vertical interval coordinate section in the vertical direction according to all the area pitch coordinates, and determining a vertical demarcation area according to the vertical interval coordinate section and the horizontal interval coordinate section, respectively, comprises: Determining a horizontal demarcation region from the page element region according to the frame coordinates and the element type of the page element region; based on the horizontal demarcation region, dividing the page image into plates along the horizontal direction to obtain at least two reference plates; For each reference plate, determining the horizontal interval coordinate interval according to the region interval coordinates; For each reference plate, determining the vertical interval coordinate interval according to the region interval coordinates; And determining a vertical demarcation region according to the vertical interval coordinate interval and the horizontal interval coordinate interval.
3. The method of claim 2, wherein after said determining a vertical demarcation region from said vertical and horizontal spaced coordinate intervals, the method further comprises: dividing the reference plate into areas based on the vertical demarcation areas to obtain at least two fence blocks; carrying out region sequencing on each page element region according to the block coordinates of each column block and the frame coordinates; and executing the steps, wherein for every two adjacent column blocks in the page image, the last page element area of the previous column block and the first page element area of the next column block are used as target element areas according to the sorting identification.
4. A method according to claim 3, wherein said ordering each of said page element regions according to the block coordinates and said frame coordinates of each of said column blocks comprises: Performing block sorting on all the column blocks according to the block coordinates of each column block and a preset sorting rule to obtain a block sequence; and sequentially ordering each page element area according to the block sequence, the frame coordinates and the ordering rule to obtain an ordering identifier of each page element area.
5. The method of any one of claims 1 to 4, wherein the calculating region pitch coordinates from the bezel coordinates comprises: Determining a reference element area from the page element area corresponding to the minimum coordinate value in the vertical direction in the frame coordinate, wherein the positive direction of the vertical direction is vertical downward; determining a frame height coordinate interval of each page element area in the vertical direction according to the frame coordinates; The method comprises the steps of carrying out combined division on each page element region according to frame coordinates, wherein the frame high coordinate section of the reference element region is taken as a reference coordinate section, carrying out peer region screening on the page element region according to the reference coordinate section and other frame high coordinate sections to obtain candidate element regions, determining a reference coordinate section with the largest numerical value section from the reference coordinate section and the frame high coordinate sections of the candidate element regions, calculating a superposition coordinate value between the reference coordinate section and the frame high coordinate section of the page element region for each page element region, carrying out ratio calculation according to the superposition coordinate value and the section length of the frame high coordinate section of the page element region to obtain a height ratio, carrying out screening on the page element region according to the height ratio to obtain an intermediate element region, and combining the intermediate element region, the reference element region and the candidate element region as one horizontal element; Taking the page element area closest to the page element area corresponding to the reference coordinate interval in the vertical direction as a new reference element area, and returning to the step of carrying out combined division on each page element area according to the frame coordinates until all the page element areas are grouped; for each horizontal element combination, calculating the region interval coordinates between every two page element regions adjacent in the horizontal direction in the horizontal element combination.
6. The method according to any one of claims 1 to 4, wherein the merging of the two target element regions according to the element type of the target element region comprises: If the element types of the target element areas are the same, and the element types are not images; Carrying out semantic recognition according to each target element area to obtain semantic features; performing association degree evaluation on the two semantic features to obtain element association scores; And if the element association score is greater than or equal to a preset association score threshold, merging the elements of the two target element areas.
7. The method according to any one of claims 1 to 4, wherein after said merging of two of said target element regions according to their element types, the method further comprises a merging operation of adjacent said document pages, specifically comprising: for every two page images adjacent to each other in sequence in the original document, taking the last page element area of the previous page image and the first page element area of the next page image as a page crossing element area according to the sequence identification; And merging the two page crossing element areas according to the element types of the page crossing element areas.
8. A merging device for document elements, the device comprising: The target recognition module is used for acquiring an original document, wherein the original document comprises at least two document pages, and carrying out target recognition on a page image of each document page to obtain at least two page element areas and frame coordinates of each page element area; The space calculating module is used for calculating area space coordinates according to the frame coordinates, wherein the area space coordinates are coordinate points between every two adjacent page element areas in the preset horizontal direction; The demarcation region determining module is used for respectively determining a horizontal interval coordinate interval in the horizontal direction and a vertical interval coordinate interval in the vertical direction according to all the region interval coordinates and determining a vertical demarcation region according to the vertical interval coordinate interval and the horizontal interval coordinate interval, wherein the vertical direction is vertical to the horizontal direction; The sorting module is used for dividing the page image based on the vertical demarcation region to obtain at least two column blocks, and sorting the page element regions according to the column blocks to obtain sorting identification of each page element region; The target area determining module is used for regarding each two adjacent column blocks in the page image, and taking the last page element area of the previous column block and the first page element area of the next column block as target element areas according to the sorting identification; and the element merging module is used for merging the two target element areas according to the element types of the target element areas.
9. An electronic device comprising a memory storing a computer program and a processor implementing the method of any of claims 1 to 7 when the computer program is executed by the processor.
10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 7.

Description

Document element merging method and device, electronic equipment and storage medium Technical Field The present application relates to the field of artificial intelligence technologies, and in particular, to a method and apparatus for merging document elements, an electronic device, and a storage medium. Background In structuring PDF documents, various elements (e.g., text, tables, and pictures) in the document need to be merged. For example, elements of a PDF document are typically identified and extracted on a page-by-page basis, which results in that when a paragraph of text spans multiple pages or is split by elements such as tables, pictures, etc., text that would otherwise belong to the same paragraph may be incorrectly identified as multiple independent text elements, where the text elements belonging to the same paragraph need to be combined to ensure continuity and accuracy of the document content. In the prior art, a method for performing cross-page table merging by using a deep learning model exists, but only table elements are considered in the method, and the method is not suitable for some complex documents (such as PDFs comprising other elements with cross columns in multiple pages and the same page), so that the merging accuracy of document elements is lower. Therefore, how to improve the accuracy of merging document elements becomes a technical problem to be solved. Disclosure of Invention The embodiment of the application mainly aims to provide a method and a device for merging document elements, electronic equipment and a storage medium, aiming at improving the accuracy of merging the document elements. To achieve the above object, a first aspect of an embodiment of the present application provides a method for merging document elements, where the method includes: Acquiring an original document, wherein the original document comprises at least two document pages, and carrying out target recognition on a page image of each document page to obtain at least two page element areas and frame coordinates of each page element area; calculating region interval coordinates according to the frame coordinates, wherein the region interval coordinates are coordinate points between every two adjacent page element regions in a preset horizontal direction; Respectively determining a horizontal interval coordinate interval in the horizontal direction and a vertical interval coordinate interval in the vertical direction according to all the region interval coordinates, and determining a vertical demarcation region according to the vertical interval coordinate interval and the horizontal interval coordinate interval, wherein the vertical direction is perpendicular to the horizontal direction; dividing the page image based on the vertical demarcation region to obtain at least two column blocks, and sorting the page element regions according to the column blocks to obtain sorting identification of each page element region, wherein the column blocks comprise at least one page element region; for every two adjacent column blocks in the page image, taking the last page element area of the previous column block and the first page element area of the next column block as target element areas according to the sorting identification; And merging the two target element areas according to the element types of the target element areas. In some embodiments, the determining a horizontal interval coordinate interval in the horizontal direction and a vertical interval coordinate interval in the vertical direction according to all the area interval coordinates, and determining a vertical demarcation area according to the vertical interval coordinate interval and the horizontal interval coordinate interval includes: Determining a horizontal demarcation region from the page element region according to the frame coordinates and the element type of the page element region; based on the horizontal demarcation region, dividing the page image into plates along the horizontal direction to obtain at least two reference plates; For each reference plate, determining the horizontal interval coordinate interval according to the region interval coordinates; For each reference plate, determining the vertical interval coordinate interval according to the region interval coordinates; And determining a vertical demarcation region according to the vertical interval coordinate interval and the horizontal interval coordinate interval. In some embodiments, after said determining a vertical demarcation region from said vertical and said horizontal spacing coordinate intervals, the method further comprises: dividing the reference plate into areas based on the vertical demarcation areas to obtain at least two fence blocks; carrying out region sequencing on each page element region according to the block coordinates of each column block and the frame coordinates; and executing the steps, wherein for every two adjacent column blocks in the page image