CN-122024261-A - Cross-page table identification method in document
Abstract
The application discloses a method for identifying cross page tables in a document, which comprises the steps of detecting a table area in a document page picture, cutting the table area to obtain the table picture, identifying the table structure of the cut table picture, setting a first table at the bottom of a first page and a second table at the top of a second page, setting the first page and the second page as front and back adjacent pages, and judging the first table and the second table as the cross page tables if the head of the first table is identical to the head of the second table or the last row of the first table and the first row of the second table have the same table structure and the content of each same column of cells belongs to the same entity category. The application can realize the identification of the cross page table.
Inventors
- WU YIZI
- TUO SUXING
- CAI JIAXIAO
- LI KE
- KONG BO
- DU WEN
Assignees
- 湖南中烟工业有限责任公司
Dates
- Publication Date
- 20260512
- Application Date
- 20241111
Claims (10)
- 1. A method for identifying cross page tables in a document, comprising: s1, detecting a form position, namely detecting a form region in a document page picture, and cutting out the form region to obtain the form picture; S2, identifying a table structure, namely identifying the table structure of the cut table picture; S3, judging a cross-page table; S3.1, structure judgment Setting the bottom of the first page as a first table, and the top of the second page as a second table; judging whether the second table has a header or not; If the second table has the header, judging whether the header of the first table is identical to the header of the second table, if so, judging that the first table and the second table are page-crossing tables, otherwise, judging that the first table and the second table are non-page-crossing tables; if the second table does not have a header, judging whether the last row of the first table and the first row of the second table have the same table structure, if not, judging that the first table and the second table are non-page-spread tables, otherwise, turning to step S3.2; S3.2, judging semantic relation; And identifying the named entity by using the trained named entity identification NER model, identifying the cell content of each same column in the last row of the first table and the first row of the second table, determining the entity category, and judging the first table and the second table as the page-crossing table if all the identified cell content of the same column belongs to the same entity category.
- 2. The method of cross page table identification in a document of claim 1, wherein the method further comprises: S4, merging the tables, and merging the information of the page-crossing tables to obtain merged table information.
- 3. The method of cross page table identification in a document of claim 2, wherein the method further comprises: S5, converting a table format; And converting the combined table information into an HTML format, generating a standard HTML table label (< table >, < tr >, < th >, < td >) structure, wherein tr represents a cross row, td represents a cell, th represents a header, and identifying by the table structure.
- 4. The method of cross page table identification in a document of claim 1, wherein the method further comprises: If the first table and the second table are judged to be cross-page tables, the second table is arranged in the whole page of the second page, the second table does not have a table head, the top of the third page is the third table, the second page and the third page are front and back adjacent pages, the table head of the first table is regarded as the table head of the second table, and whether the second table and the third table are cross-page tables is judged.
- 5. A system for identifying cross-page tables in a document is characterized by comprising a table position detection module, a table structure identification module and a cross-page table judgment module; The table position detection module is used for detecting a table area in the document page picture and cutting out the table area to obtain the table picture; the table structure identification module is used for carrying out table structure identification on the cut table pictures; The cross page table judging module is used for judging the structure and comprises the following steps: setting the bottom of the first page as a first table, and the top of the second page as a second table; judging whether the second table has a header or not; If the second table has the header, judging whether the header of the first table is identical to the header of the second table, if so, judging that the first table and the second table are page-crossing tables, otherwise, judging that the first table and the second table are non-page-crossing tables; if the second table does not have a header, judging whether the last row of the first table and the first row of the second table have the same table structure, if not, judging that the first table and the second table are non-page-spread tables, otherwise, turning to step S3.2; The cross-page table judgment module is also used for semantic relation judgment; And identifying the named entity by using the trained named entity identification NER model, identifying the cell content of each same column in the last row of the first table and the first row of the second table, determining the entity category, and judging the first table and the second table as the page-crossing table if all the identified cell content of the same column belongs to the same entity category.
- 6. The document cross page table identification system of claim 5, wherein the system further comprises: and the form merging module is used for merging the information of the page-crossing forms to obtain merged form information.
- 7. The method of cross page table identification in a document of claim 6, wherein the system further comprises: the table format conversion module is used for converting the combined table information into an HTML format, generating a standard HTML table label (< table >, < tr >, < th >, < td >) structure, wherein tr represents a cross row, td represents a cell, th represents a table head, and the table label is obtained through recognition of the table structure.
- 8. An electronic device is characterized by comprising a memory and a processor; the memory is used for storing a computer program; The processor for invoking the computer program to perform the method of any of claims 1 to 4.
- 9. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when run on an electronic device, causes the electronic device to implement the method of any of claims 1 to 4.
- 10. A computer program product comprising a computer program which, when run on an electronic device, causes the electronic device to implement the method of any one of claims 1 to 4.
Description
Cross-page table identification method in document Technical Field The application relates to the technical field of artificial intelligence, in particular to a method for identifying cross-page tables in a document. Background In the multi-modal understanding capability of large models, such as large language models (Large Language Model, abbreviated LLM), document (e.g., pdf document) understanding learning or document understanding dialog is an important field. For document understanding, it is first important to convert all information inside the document into a plain text format, such as h5, markdown. In pdf document understanding, however, form understanding is the most important task, especially for a spread form, since the form is located in different pages, the prior art generally recognizes it as a different form of a different page and stores it in different data chunks (chunk), which affects the ability of the LLM to answer facts from the form. In LLM training, failure to accurately extract cross-page table information can also affect LLM's ability to answer similar questions. Disclosure of Invention The invention solves the technical problem of providing a method for identifying cross-page tables in a document, which can identify the relation between the tables of two adjacent pages, judge whether the tables between different pages are cross-page tables, and if so, merge the structures. The problem that the LLM cannot effectively read and understand the page-crossing table in the past is solved, and the accuracy of the LLM for absorbing and understanding the document is improved. In a first aspect, the present application provides a method for identifying a spread form in a document, including: s1, detecting a form position, namely detecting a form region in a document page picture, and cutting out the form region to obtain the form picture; S2, identifying a table structure, namely identifying the table structure of the cut table picture; S3, judging a cross-page table; S3.1, structure judgment Setting the bottom of the first page as a first table, and the top of the second page as a second table; judging whether the second table has a header or not; If the second table has the header, judging whether the header of the first table is identical to the header of the second table, if so, judging that the first table and the second table are page-crossing tables, otherwise, judging that the first table and the second table are non-page-crossing tables; if the second table does not have a header, judging whether the last row of the first table and the first row of the second table have the same table structure, if not, judging that the first table and the second table are non-page-spread tables, otherwise, turning to step S3.2; S3.2, judging semantic relation; And identifying the named entity by using the trained named entity identification NER model, identifying the cell content of each same column in the last row of the first table and the first row of the second table, determining the entity category, and judging the first table and the second table as the page-crossing table if all the identified cell content of the same column belongs to the same entity category. In a possible implementation manner of the first aspect, the method further includes: S4, merging the tables, and merging the information of the page-crossing tables to obtain merged table information. 3. The method of cross page table identification in a document of claim 2, wherein the method further comprises: S5, converting a table format; And converting the combined table information into an HTML format, generating a standard HTML table label (< table >, < tr >, < th >, < td >) structure, wherein tr represents a cross row, td represents a cell, th represents a header, and identifying by the table structure. In a possible implementation manner of the first aspect, the method further includes: If the first table and the second table are judged to be cross-page tables, the second table is arranged in the whole page of the second page, the second table does not have a table head, the top of the third page is the third table, the second page and the third page are front and back adjacent pages, the table head of the first table is regarded as the table head of the second table, and whether the second table and the third table are cross-page tables is judged. In one possible implementation manner of the first aspect, the entity categories include long text, short text, cardinality, date, event, facility, geopolitical entity, language name, law/act, other place, monetary amount, ethnic/religious group or political group, ordinal number, organization or company name, percentage, person name, product name, quantity, time and work. The application provides a system for identifying cross-page tables in a document, which comprises a table position detection module, a table structure identification module and a cross-page table judgment modul