CN-121979905-A - Knowledge base content accurate generation method and system based on deep semantic matching

CN121979905ACN 121979905 ACN121979905 ACN 121979905ACN-121979905-A

Abstract

The invention provides a knowledge base content accurate generation method and system based on deep semantic matching. According to the method, a refined semantic verification and fusion network (Fine-GRAINED SEMANTIC Validation and Fusion Network, FSVFN) fused with knowledge graph enhancement is introduced between traditional retrieval and generation, the network is responsible for carrying out deep analysis, cross verification and intelligent recombination on the knowledge segments of preliminary recall, and a high-quality and high-confidence knowledge context which is purified is provided for a final generation model, so that the quality of generated contents is fundamentally improved. The invention improves the accuracy, reliability and consistency of the automatically generated knowledge content.

Inventors

ZHANG ANHUA
E LILI
WANG WEI
ZHENG JINXING
GAO JINXUAN
FENG CHENGLIANG
ZONG YAN

Assignees

图观(天津)数字科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260407

Claims (10)

1. A knowledge base content accurate generation method based on deep semantic matching is characterized by comprising the following steps: Step 1, carrying out semantic blocking processing on source data to obtain text blocks, converting the text blocks into high-dimensional semantic vectors by adopting a pre-training language model, associating key entities in the text with knowledge patterns in a knowledge base, embedding knowledge pattern nodes as additional characterization of the text blocks, constructing a knowledge pattern sub-graph structure focusing on local related information based on the associated entities and extracted semantic relations, and matching and aligning different expressed but consistent entities in different knowledge pattern sub-graphs or text parts to form a multi-modal vector index of text-graph fusion; Step 2, according to the knowledge topic query input by the user, query intention analysis is carried out, a vector search engine is used for carrying out preliminary recall based on semantic vectors, and meanwhile, keyword matching is combined for carrying out supplementary recall, so that a candidate knowledge fragment set is obtained; Step 3, constructing a refined semantic verification and fusion network, and inputting a query vector and a candidate knowledge fragment set into the refined semantic verification and fusion network by combining the multimodal vector index, the knowledge graph sub-graph and the additional characterization of the text graph fusion constructed in the step 1, so as to perform refined semantic verification and fusion on the candidate knowledge; And 4, constructing a structured prompt with instructions by using the high confidence knowledge segments subjected to the fine semantic verification, the network screening and the rearrangement, inputting the structured prompt with instructions into a large language model to generate highly controllable and accurate content, and merging the highly controllable and accurate content into a target knowledge base.
2. The method for precisely generating the knowledge base content based on the deep semantic matching of claim 1, wherein in the step 2, performing preliminary recall based on semantic vectors by using a vector search engine comprises: And calculating the similarity between the query vector and all index vectors based on the cosine similarity, and recalling Top-K candidate knowledge segments with highest similarity.
3. The method for precisely generating the knowledge base content based on the deep semantic matching of claim 1, wherein in the step 2, the supplementary recall combining with the keyword matching comprises the following steps: Top-M fragments containing the query core keywords are recalled using the BM25 algorithm.
4. The method for precisely generating knowledge base contents based on deep semantic matching according to claim 1, wherein in the step 3, the inputting the query vector and the candidate knowledge fragment set into a refined semantic verification and fusion network, and performing the refined semantic verification and fusion on the candidate knowledge comprises: based on the knowledge graph subgraph constructed in the step 1 and the additional characterization, calculating a deep semantic relevance score between the query vector and each candidate knowledge; carrying out multi-hop reasoning on the subgraph by using a graph neural network, and mining potential cross-layer association among entities; calculating a fact consistency score for entities and relationships contained in the candidate knowledge segments; calculating semantic overlapping degree and implication relation between candidate knowledge segments, identifying and quantifying highly repeated or logic conflict of the content, and generating redundancy/conflict degree penalty items; pruning the knowledge segments by combining the semantic relevance score, the fact consistency score and the redundancy/conflict penalty, and screening out a core knowledge set which has strong semantic relevance, high fact consistency and low redundancy and no logic conflict.
5. The method for precisely generating the knowledge base content based on the deep semantic matching of claim 4, wherein the multi-hop reasoning is performed on the subgraph by using a graph neural network, and the mining of potential cross-layer association among entities is specifically as follows: The graph neural network takes a subgraph of a knowledge graph as an inference basis, realizes multi-hop mining of entity relations by stacking graph convolution layers, captures indirect relations among entities which are not explicitly expressed in texts, directly blends the output entity association characteristics into calculation logic of each subsequent scoring link, and realizes dynamic calibration of semantic relativity and fact consistency scores and accurate identification of redundant content and hidden logic conflicts which are non-overlapped on a surface layer but highly associated with the entities.
6. The method for precisely generating knowledge base contents based on deep semantic matching according to claim 4, wherein the calculating deep semantic relatedness score between the query vector and each candidate knowledge comprises: the refined semantic verification and fusion network adopts a lightweight double encoder-cross attention architecture, a Query vector and each candidate segment are respectively subjected to context encoding through a BERT encoder sharing weights, the encoding representation of the Query is used as Query, the encoding representation of the candidate segment is used as Key and Value, the Key and the Value are sent into a multi-head cross attention layer, the attention weight of each token in the Query to each token in the candidate segment is calculated, and finally a refined semantic relevance score is calculated for each Query vector and candidate segment pair through pooling operation on an attention weight matrix.
7. The method for precisely generating knowledge base contents based on deep semantic matching according to claim 4, wherein calculating a fact consistency score for entities and relations contained in the candidate knowledge segments comprises: inquiring the triplet facts extracted from each candidate segment in the knowledge graph; If the triples are completely existed in the knowledge graph, the score is 1.0, if the head and tail entities exist but the relationship is different, the score is 0.3-0.7 according to the similarity of the relationship, if any entity does not exist, the score is 0; and carrying out weighted evaluation on the scores of all facts in the segment to obtain the fact consistency score of the segment.
8. The method for precisely generating knowledge base contents based on deep semantic matching of claim 4, wherein calculating semantic overlap and implication relation between candidate knowledge segments, identifying and quantifying highly repeated or logical conflict of contents, generating redundancy/conflict penalty comprises: Calculating semantic similarity between any two fragments in the candidate set, and if the similarity exceeds a high threshold, marking one of the fragments as redundant; and judging whether the relation between the segment pairs is 'implication', 'contradiction' or 'neutral' by using a natural language inference model, and if the relation is 'contradiction', recording a conflict event.
9. The method for precisely generating knowledge base contents based on deep semantic matching of claim 4, wherein the step of pruning knowledge segments by combining semantic relevance scores, fact consistency scores and redundancy/conflict penalties, and the step of screening out a high-quality core knowledge set comprises the steps of: calculating a final knowledge confidence score for each candidate segment : ; Wherein, the As a candidate fragment, a fragment of interest, And Is a weight adjustable, and , Is a penalty term that is used to determine the penalty, For the semantic relevance score to be a score, A fact consistency score; setting a confidence threshold All of Is discarded, the remaining fragments are in accordance with And sequencing from high to low, and selecting Top-P fragments to form a final knowledge context set.
10. The knowledge base content accurate generation system based on deep semantic matching is characterized by comprising the following components: The data preprocessing and indexing module is used for carrying out semantic blocking processing on source data to obtain text blocks, converting the text blocks into high-dimensional semantic vectors by adopting a pre-training language model, associating key entities in the text with knowledge patterns in a knowledge base, embedding knowledge pattern nodes as additional characterization of the text blocks, constructing a knowledge pattern sub-graph structure focusing on local related information based on the associated entities and extracted semantic relations, and matching and aligning entities with different expressions but consistent directions in different knowledge pattern sub-graphs or text parts to form a multi-modal vector index of text-graph fusion; the coarse-granularity recall module is used for carrying out query intention analysis according to the knowledge topic query input by the user, carrying out preliminary recall based on semantic vectors by using the vector search engine, and carrying out supplementary recall by combining keyword matching to obtain a candidate knowledge fragment set; FSVFN module for constructing refined semantic checking and fusion network, combining multi-modal vector index, knowledge graph sub-graph and additional characterization of the text graph fusion constructed by the data preprocessing and indexing module, inputting the query vector and candidate knowledge segment set into the refined semantic checking and fusion network, and performing refined semantic checking and fusion on the candidate knowledge; the content generation module is used for constructing a structured prompt with instructions from the high-confidence knowledge segments subjected to fine semantic verification, network screening and rearrangement fusion, inputting the structured prompt with instructions into the large language model to generate highly controllable and accurate content, and combining the highly controllable and accurate content with the target knowledge base.

Description

Knowledge base content accurate generation method and system based on deep semantic matching Technical Field The invention belongs to the technical field of artificial intelligence, and particularly relates to a knowledge base content accurate generation method and system based on deep semantic matching. Background With the rapid development of information technology, enterprises and organizations accumulate massive unstructured and semi-structured data, such as technical documents, reports, mails, customer service records and the like. How to efficiently extract, organize and utilize knowledge from these data and construct an accurate and comprehensive knowledge base for enterprises has become a key to improving organization efficiency and decision making capability. The traditional knowledge base construction mainly relies on manual editing, and has the problems of low efficiency, high cost, untimely knowledge updating and the like. In recent years, the technical framework represented by the search enhancement generation (RETRIEVAL-Augmented Generation, RAG) remarkably improves the accuracy of a question-answering system and a content generation task by combining the generation capability of an external knowledge base search and a Large Language Model (LLM). Standard RAG flow is typically "search first, then generate" by first recalling text fragments related to the user query from a knowledge base using techniques such as vector search, and then providing those fragments as contexts to the LLM, leading them to generate the final answer. However, existing RAG and related knowledge base generation techniques still face the following serious challenges and technical bottlenecks: Insufficient semantic matching accuracy results in poor quality of the retrieved content, whether the traditional text matching method is based on sparse representation of keywords (such as BM 25) or preliminary vector similarity calculation (such as a basic DSSM model, the semantic gap is possibly encountered when complex queries or long-tail problems are processed, the semantic gap is difficult to accurately capture deep semantic association, and recalled knowledge segments are possibly related to the query subject but are not core answers or contain a large amount of noise and redundant information, so that the input quality of subsequent generation links is directly polluted. Knowledge fusion and coordination are poor, and information conflict and redundancy problems are highlighted by the prior art which typically directly concatenates multiple relevant knowledge pieces into LLM when they are recalled from multiple data sources or documents. This simple approach ignores inherent links, potential conflicts, or fact inconsistencies between knowledge segments. LLM is difficult to effectively discriminate and fuse in the face of these mixed and even contradictory information, and is easy to generate logic confusion, fact errors or less serious generated content The "illusion" problem of large language models is difficult to eradicate, the LLM has the risk of pinching facts (i.e. "illusions") by a void when generating content, even with reference text. When the retrieved context information is insufficient, of low quality or contradictory, the probability of the model creating a illusion increases significantly. This is fatal to the knowledge base construction task requiring high fact accuracy, severely affecting the credibility of the knowledge base. Lack of structured knowledge guidance, difficulty in verifying identity is the fact that unstructured text itself lacks explicit physical relationships and logical constraints. Most of the existing methods rely on the surface layer semantics of the text only, and ignore the hidden structured knowledge (such as entities, relationships, attributes) behind the text. This makes it difficult for the system to cross-verify the fact level of the generated content, and cannot guarantee consistency of the generated knowledge with the recognized facts in the field. Therefore, a new knowledge base content generation strategy is needed to overcome the above-mentioned drawbacks, and the strategy not only can realize higher-precision semantic matching, but also can perform intelligent fusion and verification on the retrieved multi-source information, and finally, generate accurate and reliable knowledge base content in a highly controllable manner. Disclosure of Invention In view of the above, the present invention aims to overcome the shortcomings of the prior art, and provides a knowledge base content accurate generation method and system based on deep semantic matching, so as to significantly improve accuracy, reliability and consistency of automatically generating knowledge content. In order to achieve the above purpose, the technical scheme of the invention is realized as follows: In a first aspect, the present invention provides a knowledge base content accurate generation method based on deep