CN-122021821-A - Knowledge graph-based data standard document analysis method and device

CN122021821ACN 122021821 ACN122021821 ACN 122021821ACN-122021821-A

Abstract

The invention discloses a knowledge graph-based data standard document analysis method and device, and relates to the technical field of data processing. The method comprises the steps of receiving an electronic data stream of a data standard document in a target field, performing layout structuring processing to generate a standardized structure text, performing regular pattern-based feature matching and semantic vector-based entity extraction on the standardized structure text in parallel, generating and performing conflict detection and weighted fusion on a first extracted data set and a second extracted data set, generating a verification request on a low-confidence object, receiving a manual correction instruction, responding to an effective data element and the correction instruction, constructing a data standard knowledge graph according to a preset data element model, generating an incremental training sample based on the correction instruction, and performing rule updating on an analysis rule base or parameter fine tuning on a natural language processing model. The method solves the problems of low efficiency, poor accuracy, lack of semantic association and incapability of self-adaptive optimization in the prior art.

Inventors

YANG JUNHAN
Liu Fukuo
Zu Xia
Duan Linjia
LIAO GUOYUAN

Assignees

重庆数字资源集团有限公司

Dates

Publication Date: 20260512
Application Date: 20251218

Claims (10)

1. The method for analyzing the data standard document based on the knowledge graph is characterized by comprising the following steps of: Receiving an electronic data stream of a target field data standard document to be analyzed, and performing layout structuring processing on the electronic data stream to generate a standardized structure text containing hierarchical structure labels and content data; Invoking a preset analysis rule base and a natural language processing model, and executing a feature matching process based on a regular pattern on the standardized structure text in parallel to generate a first extraction data set, and executing an entity extraction process based on a semantic vector to generate a second extraction data set; Performing conflict detection and weighted fusion on the first extracted data set and the second extracted data set, generating a preliminary data element object set and calculating a confidence value; comparing the confidence coefficient value with a preset threshold value, screening effective data elements, generating a verification request for data element objects with the confidence coefficient value lower than the preset threshold value, and receiving correction instruction data for the data element objects through an interactive terminal; Responding to the effective data element and the correction instruction data, mapping the data element attribute into a map node attribute according to a preset data element model, and establishing a correlation edge between data element nodes to generate a data standard knowledge map; And generating an incremental training sample based on the correction instruction data, and updating rules of the analysis rule base or performing parameter fine adjustment on the natural language processing model.
2. The knowledge-graph-based data standard document parsing method according to claim 1, wherein the performing layout structuring processing on the electronic data stream to generate standardized structure text containing hierarchical structure labels and content data includes: If the document type is in an image format, calling an optical character recognition OCR algorithm to convert image pixel data into character coding data, and executing an image denoising and geometric correction algorithm; calling a layout analysis algorithm, extracting geometric features in a document, and identifying and marking coordinate ranges of a chapter title region, a paragraph text region, a table structure region, a list item region and an annex region; Based on the identified region coordinate range, adding hierarchical XML or JSON structure labels to the character coding data or the original text data to construct intermediate format data of a tree structure, wherein the intermediate format data comprises hierarchical nodes defined by identification data elements, table nodes for identifying data element attributes and reference nodes referenced by an identification code table; And serializing and outputting the standardized structure text fused with the plain text content and the hierarchical structure label.
3. The knowledge-graph-based data standard document parsing method of claim 1, wherein the concurrently executing a regular pattern-based feature matching process to generate a first extracted dataset comprises: loading an analysis rule set matched with the domain identifier of the current document from the analysis rule library, wherein the analysis rule set comprises a precompiled regular expression, a keyword index table and position anchor point logic; Traversing the standardized structure text by using the regular expression, matching text fragments conforming to preset format characteristics, and extracting a data element name field, a data element identifier field and a data format field; calculating the offset of the attribute description keywords in the text by using the keyword index table, and combining the position anchor point logic to intercept text fragments after the keywords as candidate attribute values; Analyzing the structure label of the table area in the standardized structure text, identifying the text content of the table head cell, establishing a mapping relation between the column index and the attribute type, extracting the cell content in the table data row as the attribute value of the corresponding data element, and generating the first extracted data set.
4. The knowledge-graph-based data standard document parsing method of claim 1, wherein the entity extraction process based on semantic vectors to generate a second extracted dataset comprises: The standardized structure text is segmented into a plurality of semantic text sequences and is input into a pre-trained large language model, and the large language model obtains vector representation capacity of standard terms through corpus training in a specific field; Calculating the dependency weight among the words in the semantic text sequence by using the attention mechanism layer of the large language model, identifying the entity boundary in the long text description, and extracting the data element definition, the data type constraint, the value range and the business constraint condition entity; Carrying out normalized format conversion on the extracted entity content, and mapping natural language description into structured key value pair data; The second extracted dataset including semantic context vectors and attribute-key-value pairs is output.
5. The knowledge-graph-based data standard document parsing method of claim 1, wherein said performing collision detection and weighted fusion on said first extracted dataset and said second extracted dataset comprises: aligning the first extracted dataset with objects in the second extracted dataset based on a data element identifier; Comparing the aligned attribute values field by field, if the attribute values are consistent, merging the data and increasing the confidence weight of the corresponding attribute field; If the attribute values are inconsistent, selecting according to preset source weight logic, selecting the numerical value of the first extracted data set for the data format, the identifier and the enumeration value attribute with strong structuring, and selecting the numerical value of the second extracted data set for the unstructured data definition, the service description and the complex constraint attribute; and weighting and calculating the confidence value of the fused data element object based on the hit identification matched by the rule, the output probability value of the NLP model and the non-empty rate of the attribute field.
6. The knowledge-graph-based data standard document analysis method according to claim 1, wherein mapping the data element attributes into graph node attributes according to a preset data element model, and establishing associated edges between the data element nodes, comprises: instantiating discrete data meta-information into an object conforming to a preset meta-data specification, and filling identifier, chinese name, english name, synonymous name, definition, data type, representation format and value domain attribute fields; performing a correlation analysis algorithm in the data element object set, and establishing the following correlation edges: Forming a related edge, namely connecting a composite data entity node with a sub-data element node contained in the composite data entity node; Referencing an associated edge, namely connecting a data element node with an external standard code table node referenced by the value range constraint of the data element node; Connecting different data element nodes with the same semantic feature vector or mapping relation; And storing the data element object and the associated edge into a graph database to form the data standard knowledge graph.
7. The knowledge-graph-based data standard document analysis method according to claim 1, wherein the generating an incremental training sample based on the correction instruction data, updating rules of the analysis rule base or performing parameter fine adjustment on the natural language processing model, comprises: Analyzing the correction instruction data, extracting text difference characteristics before and after manual correction, and generating correction sample data; If the correction sample data indicate rule matching errors, extracting text pattern features in the samples, automatically generating new regular expressions or adjusting boundary conditions of the existing rules, and writing the new regular expressions into the analysis rule base; If the correction sample data indicate semantic extraction errors, converting the correction sample data into labeled training data, inputting the natural language processing model to execute a back propagation algorithm, and updating weight parameters of the natural language processing model; and counting the calling frequency and the correction rate of each rule in the analysis rule library, and automatically removing the rule with the correction rate higher than a preset threshold value.
8. The knowledge-graph-based data standard document parsing method according to claim 1, further comprising: Receiving configuration parameters aiming at a specific industry standard analysis task through a configuration interface, and loading a corresponding industry rule subset and a field knowledge graph mode from the analysis rule base; in the analysis process, a rule matching log and a model reasoning log are generated, and intermediate data generated in the analysis process and a final data standard knowledge graph are subjected to versioning storage; And the data service interface is used for providing export service of the standard data element list, query response service of the knowledge graph and API call service.
9. The knowledge-graph-based data standard document analysis method according to claim 1, wherein the generating a verification request for the data element object with the confidence value lower than the preset threshold value, receiving correction instruction data for the data element object through an interactive terminal, includes: generating interface rendering data containing conflict field highlighting marks and sending the interface rendering data to a display module of the interactive terminal; sorting the preliminary data element object set according to the confidence value, and pushing the verification request of the low-confidence data element to the head of a request queue preferentially; And if the instruction causes logic conflict between data elements, generating an alarm signal and feeding back the alarm signal to the interactive terminal.
10. The utility model provides a data standard document analysis device based on knowledge graph which characterized in that includes: the document preprocessing module is used for receiving an electronic data stream of a target field data standard document to be analyzed, performing layout structuring processing on the electronic data stream and generating a standardized structure text containing a hierarchical structure tag and content data; The dual-track feature extraction module is used for calling a preset analysis rule base and a natural language processing model, executing a feature matching process based on a regular pattern on the standardized structure text in parallel to generate a first extraction data set, and generating a second extraction data set based on an entity extraction process of a semantic vector; the data fusion module is used for performing conflict detection and weighted fusion on the first extraction data set and the second extraction data set, generating a preliminary data element object set and calculating a confidence value; The man-machine checking module is used for comparing the confidence coefficient value with a preset threshold value, screening effective data elements, generating a checking request for data element objects with the confidence coefficient value lower than the preset threshold value, and receiving correction instruction data for the data element objects through an interactive terminal; the map construction module is used for responding to the effective data elements and the correction instruction data, mapping the data element attributes into map node attributes according to a preset data element model, establishing the associated edges among the data element nodes and generating a data standard knowledge map; And the self-evolution module is used for generating an increment training sample based on the correction instruction data, and carrying out rule updating on the analysis rule base or carrying out parameter fine tuning on the natural language processing model.

Description

Knowledge graph-based data standard document analysis method and device Technical Field The invention relates to the technical field of data processing, in particular to a method and a device for analyzing a data standard document based on a knowledge graph. Background With the deep advancement of digital economy and government informatization, data management has become a key foundation for improving industry efficiency and cross-department collaboration. In order to standardize data definition and exchange formats, national authorities and industry authorities have issued massive data standard documents. These documents are usually in unstructured or semi-structured form, such as PDF files, scanned pictures, word documents, etc., which contain a large amount of core information such as data element definitions, value range codes, business rules, and logic constraints. Constructing a high-quality standardized data asset library, and firstly, accurately analyzing and structurally extracting the multi-source heterogeneous standard documents. In the prior art, the analysis and the management of the data standard document mainly depend on the traditional manual carding mode. The data management expert or business personnel need to read the standard document through naked eyes, understand the complex business logic and typesetting structure, extract the names, identifiers, data types and definitions of the data elements one by one, and manually input the data elements into the electronic table or database system. The method is time-consuming, labor-consuming, high in labor cost, and extremely easy to cause input errors, attribute omission or understanding deviation due to professional level differences or fatigue of personnel, the consistency and accuracy of data standards are difficult to guarantee, and the development efficiency of large-scale data management work is severely restricted. To relieve the manual pressure, partially automated analytical tools have been developed. However, the prior art has significant technical limitations. Firstly, most tools only have basic optical character recognition (Optical Character Recognition, OCR) or simple regular expression matching functions, and when facing standard documents with complex formats and multiple formats (such as page crossing tables, nested lists and fuzzy scan pieces), the tools have weak layout analysis capability, so that the extracted content has low structuring degree and serious fragmentation. Second, existing extraction methods typically employ a single technical path that pure rule methods have low recall in the face of non-canonical expressions, while pure deep learning models tend to lack precision and are prone to "illusions" when extracting identifiers or codes with strict format constraints. More importantly, most of the existing methods only stay at the "text extraction" level, and cannot identify semantic relations such as implicit composition, citation or equivalent between data elements, so that an analysis result is only a discrete data list, and is not a knowledge system with reasoning capability. In addition, the conventional system lacks an adaptive optimization mechanism, and when facing to a standard document in a new field, codes or training models are often required to be re-developed, and experience cannot be accumulated from historical verification data to realize continuous evolution of the capability. In summary, the existing data standard document analysis technology is difficult to meet the urgent requirements of the current efficient and accurate construction of the data standard knowledge graph due to the fact that the existing data standard document analysis technology is excessively dependent on manual and automatic means, poor in robustness and lacks of semantic association construction capability and self-evolution capability. Therefore, an intelligent analysis scheme which can integrate advantages of rules and models and has the capabilities of man-machine collaborative optimization and knowledge graph construction is needed. Disclosure of Invention The invention provides a data standard document analysis method and device based on a knowledge graph, which solve the problems of low manual efficiency, poor automatic analysis precision, semantic association deletion and difficult self-evolution of a system in the prior art. In order to achieve the above purpose, the embodiment of the present invention adopts the following technical scheme: In a first aspect, an embodiment of the present invention provides a method for parsing a data standard document based on a knowledge graph, including: Receiving an electronic data stream of a target field data standard document to be analyzed, performing layout structuring processing on the electronic data stream, and generating a standardized structure text containing hierarchical structure labels and content data; invoking a preset parsing rule base and a natural language proce