CN-121981232-A - Multi-source heterogeneous data fusion and knowledge graph automatic construction system
Abstract
The invention belongs to the technical field of knowledge graph construction, in particular to a multi-source heterogeneous data fusion and knowledge graph automatic construction system, which comprises a semantic extraction module, a semantic extraction module and a knowledge graph analysis module, wherein the semantic extraction module extracts minimum semantic fragments from multi-source data and performs cross-source alignment to generate a symbolized semantic frame; the data purification module converts the multi-source data into symbolized predicates according to the framework, filters the low evidence data and outputs a purification set, the ternary generation module inputs the low evidence data into an extremely small rule grammar, and an initial triplet candidate set is generated by extracting entities, relations and attributes. The disambiguation calibration module generates two-hop topological fingerprints of the entity based on the candidate set, achieves disambiguation alignment and corrects conflicts, outputs disambiguation structured knowledge, fuses the disambiguation structured fingerprints with a minimum communication ontology, constructs and verifies an initial knowledge map, and the increment updating module processes newly added data, and performs increment combination and consistency maintenance to form a complete knowledge map. According to the invention, through full-link automatic construction, efficient fusion and high-quality atlas are realized.
Inventors
- WANG MIN
- JIANG JING
- ZHU CHUANRUI
Assignees
- 安徽深核信息技术有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20260403
Claims (10)
- 1. A multi-source heterogeneous data fusion and knowledge graph automatic construction system is characterized by comprising: The semantic extraction module is used for automatically extracting minimum semantic fragments from each data source, binding constraint and examples, adopting symbol similarity to calculate a standardized semantic set, matching and associating semantic units crossing the data sources, establishing semantic corresponding relations among different sources, completing cross-source alignment through symbol similarity and local logic implications, and generating a minimum connected body to obtain a symbolized semantic frame; The data purification module is used for carrying out symbolization predicate conversion on the multi-source heterogeneous data according to the symbolization semantic framework, calculating the data occurrence density under isomorphic semantics, marginalizing low evidence data according to a threshold value, and outputting a purification symbolization predicate set; The ternary generation module is used for inputting the purification symbolized predicate set into a rule grammar set, extracting entities, relations and attributes and generating an initial ternary group candidate set; The disambiguation calibration module is used for sorting the association relation of each entity according to the initial triplet candidate set, expanding the entity and relation of one-hop and two-hop association by taking each entity as a core, extracting topological structure characteristics and encoding, generating a unique two-hop relationship topological fingerprint of each entity, completing disambiguation and alignment of the entity by using fingerprint similarity, and outputting disambiguation structured knowledge; The map fusion module is used for fusing the unambiguous structured knowledge with the minimum communication body, and carrying out data semantic unification and knowledge map construction in a symbol space to form an initial knowledge map; and the increment updating module is used for carrying out increment combination on the newly added data according to the initial knowledge graph to form a complete knowledge graph.
- 2. The system for automatically constructing the multi-source heterogeneous data fusion and the knowledge graph according to claim 1, wherein the semantic extraction module comprises: the semantic extraction unit is used for cleaning and boundary dividing the multi-source data, extracting minimum semantic fragments which can not be split again and forming a basic semantic set; the constraint binding unit is used for binding corresponding logic constraint and practical application examples for each semantic segment according to the basic semantic set, defining a semantic boundary and a use scene and generating a standardized semantic set; The cross-source alignment unit is used for calculating the standardized semantic set and the local logic implication judgment by adopting the symbol similarity, matching and associating the semantic units of the cross-data sources, establishing semantic corresponding relations among different sources, and obtaining a semantic association result; The ontology simplifying unit is used for eliminating redundant semantics and unnecessary association relations according to the semantics association result, reserving the least semantic nodes for supporting core logic and constructing the least communicated ontology; And the symbol modeling unit is used for carrying out symbolization specification and logic structuring treatment on the minimum connected ontology, unifying a symbolization system and a connected relation expression, and generating a symbolized semantic frame.
- 3. The system for automatically constructing the multi-source heterogeneous data fusion and the knowledge graph according to claim 2 is characterized in that the specific steps for obtaining the semantic association result in the cross-source alignment unit are as follows: Performing feature vector representation on cross-data source semantics in the standardized semantic set, extracting symbol attributes, logic structures and constraint information of semantic segments, and generating semantic feature vectors; Calculating the similarity degree between every two semantemes by adopting a symbol similarity algorithm according to the semantic feature vector to obtain a semantic similarity matrix; According to the semantic similarity matrix and the local logic implication judging rule, verifying the deduction relation between the front part and the back part between the semantics, and screening candidate matching pairs; and establishing mapping association among the cross-data source semantic units according to the candidate matching pairs, forming a stable semantic corresponding relation, and outputting a semantic association result.
- 4. The multi-source heterogeneous data fusion and knowledge graph automatic construction system according to claim 1, wherein the data purification module comprises: the predicate mapping unit is used for adapting the multi-source heterogeneous data to predicate conversion logic according to the data type according to predicate rules defined by the symbolized semantic framework, mapping the multi-source heterogeneous data into standardized predicate expressions, determining semantic predicates and parameters of each data unit and outputting an initial symbolized predicate set; The frequency statistics unit is used for counting the occurrence frequency of each symbolized predicate according to the initial symbolized predicate set by taking isomorphic semantics as grouping dimensions, calculating the occurrence frequency of the predicates by combining the total data quantity participating in conversion, and outputting a frequency statistics table; the threshold marginalization unit is used for setting a density threshold according to the business scene, screening low evidence predicates with density lower than the threshold according to the frequency statistical table, eliminating corresponding entries from the initial symbolized predicate set, and outputting an intermediate predicate set; And the semantic purification unit is used for carrying out semantic consistency verification according to the intermediate predicate set, eliminating predicate logic conflict, merging predicate entries with repeated contents and outputting a purification symbolized predicate set.
- 5. The system for automatically constructing the multi-source heterogeneous data fusion and the knowledge graph according to claim 4, wherein the specific steps of outputting the intermediate predicate set in the threshold marginalization unit are as follows: setting a density threshold value adapting to the current semantic environment by combining knowledge reliability and data distribution characteristics of the service scene; traversing the density value of each symbolized predicate in the frequency statistics table according to the density threshold value, screening out predicates with density smaller than the threshold value, and summarizing to generate threshold value marginalized low evidence data; And matching and eliminating predicate entries in the initial symbolized predicate set according to the generated threshold marginalized low evidence data, and outputting an intermediate predicate set.
- 6. The system for automatically constructing the multi-source heterogeneous data fusion and the knowledge graph according to claim 1, wherein the ternary generation module comprises: the grammar analysis unit is used for adapting the purification symbolized predicate set with the automatically-induced minimum rule grammar set, carrying out structural analysis on each predicate expression according to grammar rules, and outputting a predicate analysis structure; The tree positioning unit is used for constructing a grammar parsing tree according to the predicate parsing structure, determining semantic roles corresponding to all nodes in the parsing tree, positioning positions of entity nodes, relation nodes and attribute nodes, and outputting a label parsing tree structure; And the ternary generation unit is used for extracting three core elements of entities, relations and attributes in the tree according to the annotation analysis tree structure, combining the three core elements according to the ternary group format of the entities, the relations and the attributes, and outputting an initial ternary group candidate set.
- 7. The multi-source heterogeneous data fusion and knowledge-graph automatic construction system according to claim 1, wherein the disambiguation calibration module comprises: The topology coding unit is used for combing the association relation of each entity according to the initial triplet candidate set, expanding the entity and relation of one-hop and two-hop association of each entity by taking each entity as a core, extracting topological structure characteristics and coding, generating a unique two-hop relationship topology fingerprint of each entity, and outputting an entity topology fingerprint set; the fingerprint alignment unit is used for calculating the topological fingerprint similarity of different entities according to the entity topological fingerprint set, screening the entities with the similarity reaching a preset threshold, judging the entities as the same entity, performing disambiguation and alignment, and outputting an entity mapping result and an associated triplet; And the conflict calibration unit is used for carrying out logic verification on all triples according to the entity mapping result and the associated triples, identifying the items with conflicts of the entities, the relations or the attributes, and outputting the disambiguation-free structured knowledge by correcting conflict contents, reserving high-credibility information and disambiguating.
- 8. The system for automatically constructing the multi-source heterogeneous data fusion and the knowledge graph according to claim 7, wherein the specific steps of outputting the entity topology fingerprint set in the topology coding unit are as follows: traversing all entities and associated edges in the initial triplet candidate set, establishing a direct association list for each entity, extracting one-hop adjacent entities and corresponding relations, and outputting an entity one-hop association set; According to the entity one-hop association set, extending outwards to obtain indirectly associated entities and relations, forming a complete relation subgraph covering two hops, and outputting an entity two-hop topology subgraph set; and carrying out structural feature extraction and serialization coding on the entity two-hop topological sub-graph set, generating a hash code of a unique identification structure, obtaining and summarizing two-hop relation topological fingerprints.
- 9. The multi-source heterogeneous data fusion and knowledge graph automatic construction system according to claim 1, wherein the graph fusion module comprises: The semantic fusion unit is used for establishing a semantic mapping relation between the unambiguous structured knowledge and the minimum connected ontology in a symbol space, integrating and fusing the entity, relation and attribute of the unambiguous structured knowledge and the semantic specification of the minimum connected ontology, pushing data semantic unification and knowledge map primary construction, and outputting an intermediate map; the consistency check unit is used for carrying out fusion consistency check according to the intermediate atlas, comprehensively detecting conflict information on entity semantics, relationship logic and attribute constraint, recording conflict positions and specific contents, and outputting atlas fragments; And the conflict correction unit is used for carrying out backtracking analysis on the conflict information according to the map segments, correcting the conflict content by combining the semantic specification of the minimum UNICOM ontology, forming a fused and constructed closed loop, and outputting an initial knowledge map.
- 10. The system for automatically constructing the multi-source heterogeneous data fusion and the knowledge graph according to claim 1, wherein the incremental updating module comprises: The ontology generating unit is used for carrying out semantic analysis and entity extraction on the newly added data according to the ontology specification and the topological structure of the initial knowledge graph, generating a lightweight new ontology segment compatible with the graph, and outputting the newly added ontology segment; The fingerprint adding unit is used for constructing a local association subgraph according to the entity in the added body segment according to a two-hop relation rule, extracting structural features, generating an added entity topology fingerprint consistent with a graph format, and outputting an added topology fingerprint set; the incremental merging unit is used for matching the newly added body segment and the newly added topological fingerprint to the corresponding local subgraph of the initial knowledge graph, executing incremental node and edge merging and outputting a preliminary merging graph; And the consistency maintenance unit is used for checking the consistency constraint links for the preliminary combined atlas, detecting conflicts of entities, relations and attributes, correcting conflict items according to ontology rules, automatically updating the constraint links and outputting complete knowledge atlas.
Description
Multi-source heterogeneous data fusion and knowledge graph automatic construction system Technical Field The invention belongs to the technical field of knowledge graph construction, and particularly relates to a multi-source heterogeneous data fusion and knowledge graph automatic construction system. Background The traditional multi-source heterogeneous data fusion and knowledge graph automatic construction system generally lacks unified semantic management and lightweight ontology modeling capability, is difficult to perform standardized semantic alignment on multi-source heterogeneous data, has outstanding problems of data noise, redundancy and semantic ambiguity, and has insufficient cross-source fusion precision and stability. Entity disambiguation depends on literal matching or simple rules, is not combined with entity topology association characteristics, is easy to generate synonym misjudgment, and has low knowledge reliability. The fusion process lacks of full link consistency check and closed loop correction, conflicts of entities, relations and attributes frequently occur, the manual correction cost is high, and the efficiency is low. The knowledge updating mostly adopts a global reconstruction mode, incremental expansion cannot be realized, the access time of newly added data is long, the resource consumption is large, and the service dynamic expansion and long-term stable evolution are difficult to support. Disclosure of Invention In order to make up for the defects of the prior art, the invention provides a multi-source heterogeneous data fusion and knowledge graph automatic construction system. The method is mainly used for solving the problems of multi-source heterogeneous fusion ambiguity, conflict and low-efficiency construction. The invention provides a multisource heterogeneous data fusion and knowledge graph automatic construction system, which comprises the following steps: The semantic extraction module is used for automatically extracting minimum semantic fragments from each data source, binding constraint and example, and generating a minimum connected body only containing necessary semantics through cross-source alignment between symbol similarity and local logic implication to obtain a symbolic semantic framework. The data purification module is used for carrying out symbolization predicate conversion on the multi-source heterogeneous data according to the symbolization semantic framework, calculating the data occurrence density under isomorphic semantics, marginalizing low evidence data according to a threshold value, and outputting and purifying symbolization predicate sets. And the ternary generation module is used for inputting the purification symbolized predicate set into an automatically-induced minimum rule grammar set, directly extracting the entity, the relation and the attribute through a grammar analysis tree, and generating an initial ternary group candidate set. And the disambiguation calibration module is used for generating topological fingerprints of the two-hop relationship of the entity by utilizing the initial triplet candidate set, completing disambiguation and alignment of the entity by using the fingerprint similarity, checking and correcting conflict triples, and outputting unambiguous structural knowledge. And the map fusion module is used for carrying out integrated fusion on the unambiguous structured knowledge and the minimum connected ontology, synchronously carrying out unification of data semantics and construction of a knowledge map in a symbol space, checking fusion consistency, and carrying out backtracking correction on conflict information to form an initial knowledge map. And the increment updating module is used for generating a new body segment and a topology fingerprint corresponding to the new data according to the initial knowledge graph, performing increment combination with a local subgraph of the knowledge graph, and automatically maintaining a consistency constraint chain to form a complete knowledge graph. According to the multi-source heterogeneous data fusion and knowledge graph automatic construction system provided by the invention, the semantic extraction module comprises: the semantic extraction unit is used for cleaning and boundary dividing the multi-source data, extracting minimum semantic fragments which can not be split again and forming a basic semantic set. The constraint binding unit is used for binding corresponding logic constraint and practical application examples for each semantic segment according to the basic semantic set, defining a semantic boundary and a use scene and generating a standardized semantic set. The cross-source alignment unit is used for calculating the standardized semantic set and the local logic implication judgment by adopting the symbol similarity, matching and associating the semantic units of the cross-data source, and establishing semantic correspondence among different sources to obtain a