Search

CN-121996775-A - Knowledge-graph-enhancement-based pre-training language model retrieval method and system

CN121996775ACN 121996775 ACN121996775 ACN 121996775ACN-121996775-A

Abstract

The invention provides a pre-training language model retrieval method and system based on knowledge graph enhancement, comprising the steps of executing sub-graph division processing on a domain knowledge graph to generate a sub-graph index matched with a user retrieval type; the method comprises the steps of updating a domain knowledge graph and a sub-graph index by adopting an incremental training method of a graph neural network, decomposing a user query into an entity set and a relationship path through entity identification, entity disambiguation, entity linkage and relationship extraction operations, positioning the corresponding sub-graph index based on the entity set and the relationship path, extracting a related sub-graph from the domain knowledge graph, respectively encoding a head entity, a relationship and a tail entity in a related sub-graph triplet as basic vectors, and executing semantic fusion processing on the basic vectors to generate a knowledge prompt vector. The method has the advantages of accurately capturing domain structured association logic and responding domain dynamic iteration knowledge in real time.

Inventors

  • FU JIAHUI
  • WANG KUN
  • CUI HONGYAN
  • MA WENWEN
  • SUN YULONG
  • ZHANG SUPING
  • LIU HAIXIN

Assignees

  • 郑州信大先进技术研究院

Dates

Publication Date
20260508
Application Date
20251230

Claims (9)

  1. 1. A pre-training language model retrieval method based on knowledge graph enhancement is characterized by comprising the following steps: The method comprises the steps of executing sub-division processing on a domain knowledge graph to generate a sub-graph index matched with a user retrieval type, and updating the domain knowledge graph and the sub-graph index by adopting an incremental training method of a graph neural network, wherein the user retrieval type comprises a query containing semantic similar words, a query containing a definite relation chain and a multi-condition combined query; Decomposing the user query into an entity set and a relationship path through entity identification, entity disambiguation, entity link and relationship extraction operation; Respectively encoding a head entity, a relation and a tail entity in the related sub-graph triples into basic vectors, and executing semantic fusion processing on the basic vectors to generate knowledge prompt vectors; and splicing the knowledge cue vector into an input cue of a pre-training language model to guide the pre-training language model to generate a search result.
  2. 2. The knowledge-graph-enhancement-based pre-training language model retrieval method according to claim 1, wherein the incremental training method using a graph neural network updates the domain knowledge graph and the sub-graph index, comprising: Identifying a change type of a domain knowledge graph, and analyzing an association propagation range of a change node through a graph diffusion algorithm to determine a local subgraph influenced by the change, wherein the change type comprises node or edge addition, entity attribute modification and relation deletion; Dividing parameters of the graph neural network into global parameters and local parameters, wherein the global parameters are used for capturing a general relation mode of the field, and are fixed after initial training; calculating gradients on neurons corresponding to the affected subgraph based on an incremental training method, and shielding updating of irrelevant parameters by adopting a gradient mask; and updating the subimage index affected by the change according to the local parameters trained by the incremental training method.
  3. 3. The knowledge-based enhanced pre-training language model retrieval method according to claim 2, further comprising fusing path weights in self-attention computation of a pre-training language model fransformer layer to strengthen multi-hop inference path attention in the process of generating retrieval results by the pre-training language model, comprising: Screening a multi-hop relation path containing entity association logic from the related subgraph; Carrying out vectorization coding on each relation in the multi-hop relation path, and calculating the weight of each relation by combining the path length, the relation confidence and the entity association tightness degree to generate a sparse path weight matrix; dynamically presetting a path weight coefficient according to the query type, and fusing the path weight coefficient and the path weight matrix in the self-attention calculation process of a pre-training language model transducer layer to obtain path enhanced attention output; the calculation expression of each relation weight is as follows: ; wherein P ij represents the path weight between the ith query term and the jth document term, r k is the kth relationship in the path, alpha k is a learnable parameter, and f (r k ) is a relationship type weight; the computational expression for the path enhanced attention output is: ; Wherein Q is a query vector, K is a key vector, V is a value vector, lambda is a path weight coefficient, P is a path weight matrix, and d is a vector dimension.
  4. 4. The knowledge-based enhanced pre-training language model retrieval method according to claim 3, further comprising a two-channel scoring and retrieval result sorting step, the step comprising: The method comprises the steps of obtaining text semantic matching scores and knowledge graph path confidence scores, wherein the text semantic matching scores are output by a pre-training language model, the text semantic matching scores are generated based on the context semantic understanding of the pre-training language model on query and candidate contents, and the path confidence scores are calculated based on node degree centrality, relationship frequency and topology compactness searched in the knowledge graph; A learnable weighting function is adopted as a dual-channel scoring function, the weighting function adopts a gating network or a linear interpolation structure, and a weight distribution strategy of text semantic matching and map path confidence score is optimized through training iteration so as to adapt to retrieval scene requirements in different fields; inputting the text semantic matching and the atlas path confidence score into the dual-channel scoring function, dynamically distributing the weights of the two types of scores to obtain a comprehensive sorting score, sorting and outputting the search result based on the comprehensive sorting score, and simultaneously labeling the corresponding atlas evidence chain.
  5. 5. The knowledge-based enhanced pre-training language model retrieval method according to claim 4, wherein the stitching the knowledge-hint vector into the input hint of the pre-training language model to guide the pre-training language model to generate a retrieval result comprises: respectively carrying out embedded representation on a head entity, a relation and a tail entity in the related sub-graph triplet to generate respective corresponding basic vectors; semantic fusion is carried out on the basic vectors through a relationship attention mechanism, and knowledge prompt vectors are generated and used for strengthening the guiding effect of the relationship on entity association; And splicing the knowledge prompt vector with an input sequence queried by a user, and inputting the knowledge prompt vector serving as additional context information into the pre-training language model.
  6. 6. The knowledge-based enhanced pre-training language model retrieval method according to claim 5, further comprising pruning the subgraph, the pruning comprising: presetting a domain specific element path template, predefining a key element path mode, and automatically generating a corresponding derivative element path when the knowledge graph is newly added with a relation type; Calculating information contribution entropy of each node and each side in the knowledge graph, which participates in the meta-path, and quantifying the importance of the node and the side; The knowledge graph is stored according to entity type fragments, and is managed by adopting a distributed graph database, predefined element paths are traversed in each fragment in parallel, and the path frequency and the information contribution entropy value of the nodes are counted; Setting a reservation threshold of nodes and edges according to the scale of the maximum allowable subgraph of the GPU video memory capacity, sorting according to the ascending order of the entropy contribution of the information of the nodes and edges, reserving the low entropy nodes and edges preferentially, and pruning the high entropy nodes and edges; And accumulating newly added stream data of the knowledge graph into micro batches at regular time, and only recalculating information contribution entropy of a sub-image area affected by the new data and executing pruning operation to avoid full-graph pruning updating.
  7. 7. The knowledge graph enhancement-based pre-training language model retrieval system is characterized by comprising an offline processing module and an online retrieval module, wherein the offline processing module is used for dynamic construction of domain knowledge graphs, lightweight sub-graph index generation and increment optimization, and provides high-efficiency data support for online retrieval, and the online retrieval module is used for outputting accurate and interpretable retrieval results through structured query analysis, knowledge enhancement retrieval and multimodal fusion sequencing.
  8. 8. The knowledge-graph-enhancement-based pre-training language model retrieval method according to claim 7, wherein said offline processing module comprises: integrating multi-modal data sources, eliminating characterization differences of different modal data through a graph embedding alignment module, and realizing unified semantic fusion of multi-source heterogeneous data; partitioning and storing according to entity types by adopting an attribute map partitioning technology, and constructing a real-time event stream processing frame by combining APACHEKAFKA, wherein the real-time event stream processing frame is used for distributed storage of domain knowledge maps and real-time atomization adding, deleting and modifying operation of nodes and relations; Extracting entity, relation and context characteristics in a domain knowledge graph, executing sub-graph division by adopting three types of strategies of dynamic division based on a historical query log, semantic community automatic division by utilizing a graph clustering algorithm and rule-driven division according to a domain ontology, and generating a sub-graph index matched with a user retrieval type, wherein the sub-graph index comprises a graph embedded compression index, a prefix tree path index and a mixed bitmap index; The incremental updating optimizing unit is used for carrying out targeted repair updating on the subimage index influenced by the change by tracking the local change and the influence propagation range of the domain knowledge graph by adopting an incremental index updating mechanism, storing the hot subimage accessed by high frequency into a memory by a cold subimage layering storage strategy, compressing and persistence the cold subimage accessed by low frequency to a disk, balancing the memory occupation and the response speed of the hot data, and realizing query acceleration by combining subimage federal retrieval and an approximate subimage matching technology.
  9. 9. The knowledge-based enhanced pre-training language model retrieval method according to claim 8, wherein the online retrieval module comprises: The query analysis module is used for executing structural decomposition on natural language query input by a user, adopting a pre-training sequence labeling model to identify entities in the query, and accurately linking to standard nodes of a domain knowledge graph through entity disambiguation and linking technology; Mapping the entity and the relation path obtained by query analysis to a domain knowledge map, positioning and extracting a 2-3 jump related subgraph through a subgraph federation search and approximate subgraph matching technology, respectively encoding a head entity, a relation and a tail entity in a related subgraph triplet as basic vectors, performing semantic fusion processing on the basic vectors to generate knowledge prompt vectors, splicing the knowledge prompt vectors into input prompts of a pre-training language model, and fusing path weights in the self-attention of a pre-training language model Transformer layer to strengthen multi-jump reasoning path attention; And the multi-mode joint sequencing unit is used for combining the text semantic matching and the atlas path confidence score output by the pre-training language model, dynamically distributing two types of score weights through a learnable weighting function, visualizing the contribution proportion of each score by combining an attribution analysis technology, sequencing the search result based on the comprehensive sequencing score and labeling an atlas evidence chain.

Description

Knowledge-graph-enhancement-based pre-training language model retrieval method and system Technical Field The invention relates to the technical field of natural language processing, in particular to a pre-training language model retrieval method and system based on knowledge graph enhancement. Background In a real scene, task requirements show high diversity and complexity characteristics, and contents generated by a general large language model are difficult to deeply adapt to specific requirements of a specific field, and internal association between the field scene and information cannot be fully captured. When the model is used for processing the search generation task in the professional field, the phenomenon of illusion is often generated due to lack of targeted knowledge support, the accuracy of an output result is directly influenced, and the severe requirements of the professional fields such as electronic information, medical treatment, finance and the like on the information reliability are difficult to meet. To solve this problem, the industry proposes to complement the structured domain knowledge of large language models by knowledge graph. For example, in Ji Zhen, in research institute thesis "big model retrieval enhancement generation technology research based on knowledge graph" submitted in 5.31.2024, a specific technology implementation path is disclosed, namely, in the fusion layer of knowledge graph and big model, the retrieval enhancement is realized by transmitting sub-graph triples extracted from the knowledge graph as additional input or context information to the big model, or performing content enrichment and accuracy verification by using triples after answer generation, and in the medical field floor scene, storing medical knowledge graph by using Neo4j graph database, performing entity recognition on the input problem by NLTK, and transmitting the recognized keywords to the knowledge graph. And searching keywords aiming at the query questions in the knowledge graph, obtaining knowledge subgraphs related to the query questions by using a subgraph searching method of the Cypher query, and finally obtaining replies more conforming to the knowledge in the medical field by guiding a large model through the enhanced query text. Although the technical research effectively relieves the 'illusion' problem of the large model and improves the answer quality of the professional field, the technical bottleneck still exists that the knowledge graph can provide rich structured knowledge, but the data update usually needs to retrain the whole graph neural network, and the problem of knowledge coverage statics exists. If the knowledge graph fails to follow up updating in time, the large model continuously calls outdated information, and then outputs results which do not accord with the cognition of the current field, so that the application effect of the technology in a scene requiring real-time updating of knowledge is severely limited. In order to solve the above problems, an ideal technical solution is always sought. Disclosure of Invention The invention aims at overcoming the defects of the prior art, and provides a pre-training language model retrieval method and system based on knowledge graph enhancement, which are used for accurately capturing domain structured association logic and responding domain dynamic iteration knowledge in real time. In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: In a first aspect, the present invention provides a method for searching a pre-training language model based on knowledge graph enhancement, including: The method comprises the steps of executing sub-division processing on a domain knowledge graph to generate a sub-graph index matched with a user retrieval type, and updating the domain knowledge graph and the sub-graph index by adopting an incremental training method of a graph neural network, wherein the user retrieval type comprises a query containing semantic similar words, a query containing a definite relation chain and a multi-condition combined query; Decomposing the user query into an entity set and a relationship path through entity identification, entity disambiguation, entity link and relationship extraction operation; Respectively encoding a head entity, a relation and a tail entity in the related sub-graph triples into basic vectors, and executing semantic fusion processing on the basic vectors to generate knowledge prompt vectors; and splicing the knowledge cue vector into an input cue of a pre-training language model to guide the pre-training language model to generate a search result. The method comprises the steps of adapting user diversified search types to achieve accurate sub-graph index construction, combining graph neural network incremental training to complete dynamic updating of graphs and indexes, greatly improving search response efficiency and data timeliness of domain know