CN-122021629-A - Standardized data block processing system supporting multi-source data format
Abstract
The invention discloses a standardized data block processing system supporting a multi-source data format, which relates to the field of data processing, the system comprises a preprocessing and format unifying module, an intelligent pre-analysis and strategy selection module, a mixed reasoning deep block engine, a block result optimizing and checking module and a structured data output module. The method supports compatible processing of the multi-source data format, dynamically divides the chapter structure and the logic unit based on the text content, ensures the semantic association content, avoids cross-modal semantic splitting, improves universality of a blocking strategy, achieves intelligent blocking of the multi-source data, remarkably reduces requirements of manual intervention, improves processing speed and efficiency, is particularly suitable for large-scale document processing scenes, can be directly used for downstream applications such as knowledge graph construction and intelligent retrieval, and remarkably improves data utilization efficiency.
Inventors
- CUI CONGJUN
- WANG YAN
- ZHU GUANCHEN
Assignees
- 中绍宣科技集团有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260407
Claims (10)
- 1. A standardized data blocking processing system supporting a multi-source data format, comprising: The preprocessing and format unifying module is used for extracting text information in the multi-source document and converting the text information into a standardized format text containing the structure level, element relation and meta information of the original document; The intelligent pre-analysis and strategy selection module is used for acquiring the document type of the standardized format text and obtaining a block strategy package according to the document type and a strategy library matching mechanism; The mixed reasoning deep partitioning engine is used for carrying out semantic understanding on the standardized format text by adopting a pre-trained language model and outputting facts; The partitioning result optimizing and checking module is used for adjusting and checking the preliminary partitioning result according to the isolated content merging and boundary checking mechanism to obtain a partitioning result; And the structured data output module is used for outputting the content of the blocking result in a structured data format.
- 2. The standardized data partitioning processing system supporting multiple source data formats of claim 1, wherein the preprocessing and format unification module comprises an parsing module, a hierarchy conversion module, and an association labeling module; The analysis module is used for acquiring geometric layout information and page layout information of the document, and performing header footer elimination, multi-column layout reordering and table reconstruction; The hierarchical structure conversion module is used for identifying the title level by using heuristic rules or classification models and converting the title level into title marks in standardized formats; the association marking module is used for identifying content pairs with association in the document and associating the content pairs with non-text elements through the placeholders.
- 3. The standardized data blocking processing system supporting multiple source data formats of claim 1 wherein the retrieving document types of standardized format text comprises: scanning the standardized format text to obtain key features for classification; Based on the key features for classification, the document types of the standardized format text are classified by combining a classification model through keyword matching, metadata analysis and statistical analysis of the title structure.
- 4. The standardized data chunking processing system of claim 1 wherein obtaining a chunking policy package based on document type and policy repository matching mechanism comprises: matching corresponding blocking strategies from a strategy library according to the type labels of the document types; And when the matching is successful, the blocking strategy is arranged into a blocking strategy packet, and the blocking strategy packet is sent to a mixed reasoning deep blocking engine.
- 5. The standardized data partitioning processing system supporting multiple source data formats of claim 4, wherein the policy repository includes a document type tag and a policy object comprising: A set of logic rules for providing semantic integrity criteria for the content of the document; a regular expression template set for identifying key structure boundaries in a document; the domain keyword list is used for providing domain standard keywords.
- 6. The standardized data partitioning processing system for supporting multiple source data formats of claim 1, wherein said logically reasoning, in combination with the partitioning policy package, based on facts, to obtain preliminary partitioning results comprises: matching the facts with rules in a logic rule base through an inference engine, and continuing to infer until all block boundaries which meet the logic rules and are most stable and most consistent with semantic integrity are found; And obtaining a preliminary block result based on the decision path of the block boundary.
- 7. The standardized data partitioning processing system supporting multiple source data formats of claim 6, wherein the construction of the logic rules comprises: The method comprises the steps of constructing a corpus containing a multi-source format, uniformly processing documents in the corpus into a standardized format, extracting structure, content and relation features, and converting the structure, the content and the relation features into facts to obtain a fact database; Based on the fact database, a logical rule set is generalized that maximizes coverage of all positive instances in the corpus while minimizing negative instances.
- 8. The standardized data chunking processing system supporting multiple source data formats of claim 1 wherein the mixed inference depth chunking engine includes a semantic analysis model and labeling each portion of text by the semantic analysis model comprises: Constructing a label system comprising preconditions, conclusions, method descriptions and object definitions, and labeling samples extracted from target documents according to the label system to obtain a labeling data set; Inputting the labeling data set into a pre-training language model, and updating a classification head of the pre-training language model and parameters of the pre-training language model by using a back propagation algorithm to obtain a semantic analysis model by calculating the loss between the probability distribution predicted by the pre-training language model and the labeled real label; the labels for each portion of text are output by the semantic analysis model.
- 9. The standardized data chunking processing system supporting multiple source data formats of claim 1 wherein said chunking result optimization and verification module comprises an orphan content merge module and a boundary check module; The isolated content merging module is used for acquiring isolated blocks in the preliminary block division result, merging the isolated blocks into other blocks according to the semantic similarity of the isolated blocks and adjacent blocks; and the boundary checking module is used for checking the block boundary in the preliminary block result and performing fine adjustment.
- 10. The standardized data chunking processing system of claim 9 wherein merging isolated chunks into other chunks based on semantic similarity of isolated chunks to neighboring chunks comprises: Traversing the result of the preliminary block, identifying an isolated block according to a preset length threshold value, and acquiring the adjacent block of the isolated block; Respectively converting the text contents of the isolated blocks and the adjacent blocks into high-dimensional semantic vectors by adopting a pre-trained sentence embedding model; According to the high-dimensional semantic vector, calculating the semantic association degree between the isolated block and the adjacent block by using cosine similarity; And carrying out merging processing of the isolated blocks and the adjacent blocks according to the semantic association degree and a preset merging threshold value, and updating a block list.
Description
Standardized data block processing system supporting multi-source data format Technical Field The invention relates to the field of data processing, in particular to a standardized data block processing system supporting a multi-source data format. Background In informatization development, data is of great importance. In the data preprocessing process, the division of the data module is beneficial to the follow-up steps of data feature extraction and the like. However, most of the current data partitioning algorithms perform module partitioning based on the defined tokens number, which can cause data semantic segmentation on content related to semantics, so that the subsequent information extraction is insufficient, and the like, specifically: 1. the semantic fragmentation and key information breaking are that cutting text only by fixed tokens number, such as dividing every 500 words into a module, can forcedly cut off complete semantic units, such as a paragraph, a technical term and an event description. 2. And the context association is lost, the semantic ambiguity risk is increased, namely, after semantic related contents are divided into different modules, the context support is lost, the ambiguity can be caused, and the extracted index object attribute is disjointed with the actual content, so that the structured data is inaccurate. 3. The adaptability of the domain knowledge is poor, the industry characteristics are ignored, and documents in different domains, such as medical treatment, chemical industry and IT, have unique semantic structures and a special term system, and the fixed tokens quantity division cannot be adapted to the industry characteristics. 4. The data utilization efficiency is low, invalid calculation is increased, irrelevant contents are drawn into the same module, for example, two paragraphs with irrelevant subjects are forcedly combined, redundant information needs to be processed during extraction, and calculation resources are wasted. For the problems in the related art, no effective solution has been proposed at present. Disclosure of Invention In order to solve the problems in the related art, the present invention provides a standardized data block processing system supporting multiple source data formats, so as to overcome the above technical problems in the related art. For this purpose, the invention adopts the following specific technical scheme: A standardized data blocking processing system supporting multiple source data formats, comprising: The preprocessing and format unifying module is used for extracting text information in the multi-source document and converting the text information into a standardized format text containing the structure level, element relation and meta information of the original document; The intelligent pre-analysis and strategy selection module is used for acquiring the document type of the standardized format text and obtaining a block strategy package according to the document type and a strategy library matching mechanism; The mixed reasoning deep partitioning engine is used for carrying out semantic understanding on the standardized format text by adopting a pre-trained language model and outputting facts; The partitioning result optimizing and checking module is used for adjusting and checking the preliminary partitioning result according to the isolated content merging and boundary checking mechanism to obtain a partitioning result; And the structured data output module is used for outputting the content of the blocking result in a structured data format. Further, the preprocessing and format unifying module comprises an analyzing module, a hierarchical structure converting module and an associated marking module; The analysis module is used for acquiring geometric layout information and page layout information of the document, and performing header and footer elimination, multi-column layout reordering and table reconstruction; The hierarchical structure conversion module is used for identifying the title level by using heuristic rules or classification models and converting the title level into title marks in standardized formats; And the association marking module is used for identifying content pairs with association in the document and associating the content pairs with the non-text elements through the placeholders. Further, obtaining the document type of the standardized format text includes: scanning the standardized format text to obtain key features for classification; Based on the key features for classification, the document types of the standardized format text are classified by combining a classification model through keyword matching, metadata analysis and statistical analysis of the title structure. Further, according to the document type and the policy repository matching mechanism, obtaining the blocking policy package includes: matching corresponding blocking strategies from a strategy library according to the typ