Search

CN-122022986-A - Data understanding and extracting method for financial data report

CN122022986ACN 122022986 ACN122022986 ACN 122022986ACN-122022986-A

Abstract

The invention belongs to the technical field of financial wind control, and particularly relates to a data understanding and extracting method of a financial data report, which comprises the steps of obtaining an original file of the financial data report; the financial data report is stored as an editable version PDF file, each page of table data of an original file is traversed, dynamic mapping between the table header and a corresponding cell is built based on deep learning, the corresponding table header is identified according to a target to be analyzed, content information recorded by related cells is collected by utilizing a mapping relation between the table header and the corresponding cell in a page crossing mode, all collected content information is spliced, standardized structure data and an analysis quality report are generated according to the spliced content information, and therefore a large model can call the structure data and the analysis quality report corresponding to the structure data to carry out financial wind control analysis.

Inventors

  • Jing Xuequan
  • LIN SHAOJIE
  • LUO TING
  • ZHU ZHENHUI
  • Ruan Yuxia
  • Zong Zhaoxiu
  • GUAN TONG

Assignees

  • 朗沃格科技(上海)有限公司

Dates

Publication Date
20260512
Application Date
20260123

Claims (10)

  1. 1. A method for data understanding and extraction of financial data reports, comprising: Acquiring an original file of a financial data report, wherein the financial data report is saved as an editable version of a PDF file; Traversing each page of table data of the original file, and constructing dynamic mapping between an internal table header and a corresponding cell for each sub-table; Identifying corresponding sub-tables according to the target to be analyzed, and splicing all contents by utilizing the mapping relation between each header in the sub-tables and the corresponding cells; and generating standardized structure data and analysis quality reports according to the spliced content, so that the large model can call the structure data and the analysis quality reports corresponding to the structure data to carry out financial wind control analysis.
  2. 2. The data understanding and extraction method of a financial data report according to claim 1, further comprising, after acquiring an original file of the financial data report: Identifying and judging whether the original file is a preset report template or not: if yes, traversing each page of table data of the original file, and constructing dynamic mapping between an internal table header and a corresponding cell for each sub-table; otherwise, caching the original file into an abnormal list, and acquiring the original file of the next financial data report.
  3. 3. The data understanding and extracting method of financial data report according to claim 2, wherein the step of identifying and judging whether the original file is a preset report template comprises: extracting and analyzing the text content of the original file top page to identify whether related information of enterprise names and unified social credit codes is contained in the text content: If yes, the original file is considered to be a preset report template.
  4. 4. The data understanding and extraction method of financial data reporting according to claim 1, wherein the step of constructing a dynamic mapping between the header and the corresponding cell based on deep learning comprises: carrying out standardized format processing on the table data of each page; identifying all the header rows in each sub-table in the processed table data and merging to obtain an actual detailed header; for each detailed header, a mapping relationship is established between the detailed header and all cells in the corresponding column.
  5. 5. The data understanding and extracting method of financial data report according to claim 4, wherein the step of performing standardized format processing on the form data of each page comprises: Extracting cell information in each page table data by using a PDF analysis library, and converting all cells into two-dimensional grid tensors; the text content and the position code of the corresponding cell are embedded in the two-dimensional grid tensor.
  6. 6. The data understanding and extraction method of financial data reporting as in claim 4, wherein the step of identifying and merging all header rows therein comprises: Identifying all simple headers in the sub-table according to the last row of all header rows in the sub-table; And combining each simple header with the corresponding header information in the upper header row to obtain a detailed header.
  7. 7. The data understanding and extraction method of financial data reporting according to claim 4, further comprising, for each detail header, before mapping it to all cells of a corresponding column: And aligning the detailed header with the cell of the corresponding column according to the position of the detailed header in the corresponding table data.
  8. 8. The method for understanding and extracting data of a financial data report according to claim 1, wherein the step of identifying corresponding sub-tables according to the object to be analyzed and splicing all contents by using the mapping relation between each header in the sub-tables and its corresponding cell comprises: For each header in the sub-table, determining the total record number of the header according to the annotation of the header; Traversing each page of table data, inquiring corresponding cells through the mapping path of the table head, and splicing the contents recorded in the cells according to the front-to-back sequence of each cell until the number of the inquired cells reaches the total recorded number; And integrating all the spliced contents of the table header to obtain the spliced contents of the sub-table.
  9. 9. The data understanding and extracting method of financial data report according to claim 8, wherein the step of querying the corresponding cells through the mapping path of the header and splicing the contents recorded in the cells in the order of the respective cells comprises: Filtering the repeated cell content; Filling the default value for the empty cell.
  10. 10. The data understanding and extraction method of financial data reports according to claim 1, further comprising, before generating standardized structure data and parsing quality reports from the spliced content: verifying the spliced content, including: Carrying out compliance verification on the spliced content information by utilizing a compliance database; And carrying out logic verification on the spliced content information.

Description

Data understanding and extracting method for financial data report Technical Field The invention belongs to the technical field of financial wind control, and particularly relates to a data understanding and extracting method of a financial data report. Background Various financial data reports (such as credit reports) used as the basis of risk assessment and compliance review have the characteristics of high fragmentation and unstructured formats, are usually stored in a PDF (portable document format) form, and a downstream financial wind control model and an automated approval system can effectively work by relying on standard, complete and coherent structured data. The mechanism seriously relies on manpower to extract and check time-consuming and error-prone data, so that the efficiency is low, the cost is high, the integrity of data page-crossing splicing and consistency of business logic are more difficult to ensure, automatic processing breakpoints from non-standardized PDF reports to trusted structured data are formed, and the intelligent process of wind control decision is hindered. Disclosure of Invention In view of the above-mentioned shortcomings of the prior art, the present invention aims to provide a complete set of automatic analysis and structured enabling method, which is capable of outputting high-quality structured data and attaching an analysis quality report for large model call to perform financial wind control analysis by analyzing a financial data report in a PDF format through a four-layer architecture of format adaptation-association analysis-data fusion-verification output. The data understanding and extracting method of the financial data report comprises the steps of obtaining an original file of the financial data report, wherein the financial data report is stored as an editable version PDF file, traversing each page of table data of the original file, constructing dynamic mapping between an internal table header and a corresponding cell for each sub-table, identifying the corresponding sub-table according to a target to be analyzed, splicing all contents by utilizing mapping relations between each table header and the corresponding cell in the sub-table, and generating standardized structure data and an analysis quality report according to the spliced contents so that a large model can call the structure data and the analysis quality report corresponding to the structure data to carry out financial wind control analysis. In an embodiment of the invention, after the original file of the financial data report is obtained, the method further comprises the steps of identifying and judging whether the original file is a preset report template, if so, traversing each page of table data of the original file, constructing dynamic mapping between an internal table header and a corresponding cell for each sub-table, and otherwise, caching the original file into an abnormal list, and obtaining the original file of the next financial data report. In an embodiment of the present invention, the text content of the first page of the original document is extracted and parsed to identify whether the text content contains related information of the enterprise name and the unified social credit code, if yes, the original document is considered as a preset report template. In one embodiment of the invention, the step of establishing dynamic mapping between the header and the corresponding cell based on deep learning comprises the steps of carrying out standardized format processing on the table data of each page, identifying and merging all header rows in each sub-table in the processed table data to obtain actual detailed headers, and establishing mapping relation between each detailed header and all cells of the corresponding column. In one embodiment of the invention, the step of carrying out standardized format processing on the table data of each page comprises the steps of extracting cell information in each page of table data by using a PDF analysis library, and converting all cells into a two-dimensional grid tensor, wherein text content and position codes of corresponding cells are embedded in the two-dimensional grid tensor. In an embodiment of the present invention, the step of identifying and merging all header rows includes identifying all simple headers according to a last row of all header rows in the sub-table, and merging each simple header with header information corresponding to an upper header row to obtain a detailed header. In an embodiment of the present invention, before establishing the mapping relationship between each detail header and all the cells in the corresponding column, the method further includes aligning the detail header with the cells in the corresponding column according to the position of the detail header in the corresponding table data. In an embodiment of the invention, the steps of identifying the corresponding sub-table according to the object to be