CN-121996800-A - Causal knowledge graph construction and question-answering system based on retrieval enhancement generation and large language model

CN121996800ACN 121996800 ACN121996800 ACN 121996800ACN-121996800-A

Abstract

A causal knowledge map construction and question-answering system based on a retrieval enhancement generation and large language model comprises a multi-source heterogeneous knowledge base construction module, a retrieval enhancement generation module, a named entity identification and causal triplet extraction module and a knowledge fusion reasoning module. The invention realizes two-wheel causal triplet extraction under RAG drive by combining LLM, improves causal relation coverage and direction judgment confidence, designs a conflict detection and map fusion mechanism, ensures the uniformity and consistency of new and old causal knowledge, establishes a causal map-based question-answering system, supports multi-hop causal reasoning and outputs a reliable and interpretable answer.

Inventors

LIU LI
WU JIAXIN
LIAO JUN
BAI HAOTIAN
XU YUE
WANG YUAN
WANG XUEYING

Assignees

重庆大学

Dates

Publication Date: 20260508
Application Date: 20251223

Claims (10)

1. The causal knowledge map construction and question-answering system based on the retrieval enhancement generation and the large language model is characterized by comprising a multi-source heterogeneous knowledge base construction module, a retrieval enhancement generation module, a named entity identification and causal triplet extraction module and a knowledge fusion reasoning module; the multi-source heterogeneous knowledge base construction module processes multi-source original text data based on multi-level vector indexes to construct a causal knowledge map; The search enhancement generation module screens a search context set related to the user input query from the causal knowledge graph based on a mixed search strategy; the named entity recognition and causal triplet extraction module splices the search context set and the user input query and then inputs a large language model for sequence labeling to obtain an entity set; the named entity identification and causal triplet extraction module generates candidate causal triples based on the entity set and the search context set, and extracts candidate causal triplet sets with initial confidence not less than a threshold value from the candidate causal triples; The named entity identification and causal triplet extraction module carries out anti-fact retrieval on candidate causal triples with initial confidence coefficient smaller than a threshold value to obtain a final causal triplet set; The knowledge fusion reasoning module is used for carrying out conflict detection and knowledge map fusion on the final causal triplet set to obtain a fused causal knowledge map; and the knowledge fusion reasoning module is used for reasoning the user input inquiry based on the fused causal knowledge graph and generating a corresponding answer.
2. The causal knowledge graph construction and question-answering system based on the search enhancement generation and large language model according to claim 1, wherein the step of constructing the causal knowledge graph by the multi-source heterogeneous knowledge base construction module is as follows: A1, acquiring multi-source original text data, and preprocessing the multi-source original text data; The multi-source original text data comprises named entity labeling, causal pairs, time stamps, sources and domain labels; a2, constructing a named entity annotation data set based on the preprocessed multi-source original text data, wherein the named entity annotation data set is as follows: (1) Wherein A represents a named entity labeling dataset; representing the preprocessed multi-source original text data; m represents the total number of texts; representing named entity labels; A3, constructing a causal relation data set based on the preprocessed multi-source original text data, wherein the causal relation data set is as follows: (2) Wherein, C represents a causal relationship data set; representing a causal pair index; Representing a total number of causal pairs; Represents a causal pair, wherein, For the reason that the number of the components is, Is fruit; representing contextual evidence; representing the confidence level; A4, segmenting the preprocessed multi-source original text data based on the sliding window, wherein the segmentation is as follows: (3) in the formula, Representing text in a causal knowledge graph; Representing a segmented set of text in a causal knowledge graph; Representing a segment index; Representing the total number of segments; Represent the first Segments; the method comprises the steps of (1) representing the length of a sliding window, o representing the overlapping rate, overlap representing the overlapping length between adjacent segments, t representing the time stamp of the segment; a5 for each segment, dense and sparse vectors are generated as follows: (4) in the formula, Respectively representing dense vectors and sparse vectors; respectively representing a pre-trained transducer model and a sparse embedded model; A6 constructs vector index and inverted index as follows: (5) in the formula, Respectively representing vector indexes and inverted indexes; representing a vector index function; representing an inverted index function; are index parameters; a7, giving a theme slicing and time decay weight to each section as follows: (6)
3. the causal knowledge graph construction and question-answering system based on the search enhancement generation and large language model according to claim 2, wherein the sources of the multi-source original text data comprise public corpus, field documents, user interaction logs and experimental records; The field documents comprise standards, reports and papers; The preprocessing comprises noise filtering, data merging and term normalization.
4. A causal knowledge graph construction and question-answering system based on search enhancement generation and large language models according to claim 3, in which the noise filtering is as follows: (7) And merging the data into original text data with similarity larger than a similarity threshold, wherein the similarity is as follows: (8) The term normalization is as follows: (9) And are noted as synonyms.
5. The causal knowledge graph construction and question-answering system based on search enhancement generation and large language models according to claim 1, wherein the step of screening a set of search contexts related to a user input query from the causal knowledge graph based on a hybrid search strategy is as follows: b1, searching text related to user input query from the causal knowledge graph based on a mixed search strategy, and constructing a candidate set; The mixed search strategy comprises keyword search, dense vector search and sparse embedded search; b2 calculates a hybrid search score for all text in the candidate set as follows: (10) b3, calculating the maximum marginal correlation of all texts in the candidate set based on the mixed search score, and removing the texts with the maximum marginal correlation smaller than the marginal correlation threshold value from the candidate set; the maximum marginal correlation is as follows: (11) b4, under the condition of the context budget constraint, selecting texts from the candidate set, and constructing a retrieval context set, wherein the retrieval context set is as follows: (12) (13)
6. The causal knowledge graph construction and question-answering system based on search enhancement generation and large language model according to claim 1, wherein the step of obtaining the entity set is as follows: C1, splicing the search context set and the user input query, and inputting a large language model for sequence labeling, wherein the sequence labeling is as follows: (14) And C2, carrying out standardization processing on the labeling sequence and the entity, wherein the standardization processing is as follows: (15) c3, constructing an entity set based on the normalized labeling sequence, wherein each entity comprises a text, a type, a position and the normalized labeling sequence; types of entities include, but are not limited to, personas, organizations, places, events, times, physical objects, abstractions.
7. The causal knowledge graph construction and question-answering system based on search enhancement generation and large language models according to claim 1, wherein the step of extracting candidate causal triplet sets is as follows: D1 generates a candidate cause and effect triplet based on the entity set and the search context set, as follows: (16) d2 calculates the initial confidence of all candidate causal triples as follows: (17) (18) D3, constructing and initializing a candidate causal triplet set; and D4, judging whether the initial confidence coefficient of all the candidate causal triples is not smaller than an initial confidence coefficient threshold value, and if so, putting the corresponding candidate causal triples into a candidate causal triplet set.
8. The causal knowledge graph construction and question-answering system based on search enhancement generation and large language models according to claim 7, wherein the step of obtaining the final causal triplet set is as follows: e1 retrieves the evidence collection of the countercheck through the countercheck query as follows: (19) (20) E2, judging whether the evidence collection of the counterevidence is empty, if not, entering a step E3, and if so, entering a step E5; e3 updates the initial confidence as follows: (21) e4, judging whether the updated confidence coefficient of all the candidate causal triples is not smaller than an anti-confidence coefficient threshold, if so, putting the corresponding candidate causal triples into a candidate causal triplet set; E5, judging whether an incomplete causal pair exists, if so, completing the missing entity based on the template and the context to form a completion set; e6, constructing a final causal triplet set based on the candidate causal triplet set and the complement set, wherein the final causal triplet set is as follows: (22) in the formula, Representing a final causal triplet set; representing a candidate causal triplet set; causes represents causal relationships; Representing evidence chain records, including sources, segments, time stamps, retrieval paths; wherein the confidence of the final causal triplet The following is shown: (23)
9. The causal knowledge graph construction and question-answering system based on the search enhancement generation and large language model according to claim 1, wherein the step of obtaining the fused causal knowledge graph is as follows: F1, performing entity alignment on the final causal triplet set and the causal knowledge graph, wherein the entity alignment is as follows; (24) f2, performing conflict detection on the final causal triplet set after entity alignment and the causal knowledge graph, wherein the conflict detection is as follows: the conflict detection is used for identifying and eliminating cause and effect triples with logic contradictions; f3, calculating the reliability of the final causal triplet set meeting the conflict detection condition as follows: (26) And F4, fusing the final causal triplet with the credibility larger than the credibility threshold into a causal knowledge graph to obtain a fused causal knowledge graph.
10. The causal knowledge graph construction and question-answering system based on search enhancement generation and large language model according to claim 1, wherein the step of generating corresponding answers is as follows: g1 compiling a user input query into a graph query, and acquiring a supporting text based on a mixed retrieval strategy; g2, multi-hop reasoning is carried out on the graph query on the fused causal knowledge graph, and the multi-hop reasoning is as follows: (27) G3 calculates the confidence level of the inference and generates an answer corresponding to the query entered by the user as follows: (28) (29) in the formula, Representing overall confidence in inferences about generating answers when Less than a preset inference threshold When the user inputs the inquiry, the answer corresponding to the inquiry cannot be generated, and the insufficient information prompt is returned; 、 weight coefficients respectively representing a path reasoning score, a text question-answer score and an evidence consistency score, an ; Representing the generated answer corresponding to the user input query q, LLM representing a large language model, docs representing a set of supportive text; representing a document in the set docs of supporting text; Representation pair query q and document Calculating an answer matching score; Representing an evidence consistency assessment function; Representing a set of all supporting evidence; K represents a constant; The first K reasoning paths with highest probability are represented; representing the top K documents most relevant to the user input query q; the answers include natural language answers, reasoning paths, text evidence and confidence.

Description

Causal knowledge graph construction and question-answering system based on retrieval enhancement generation and large language model Technical Field The invention relates to the technical field of artificial intelligence and natural language processing, in particular to a causal knowledge graph construction and question-answering system based on search enhancement generation and a large language model. Background As the scale of large language models grows, it makes significant progress in general questions and answers and information extraction. However, the intra-parameter curing knowledge has the problems of insufficient timeliness and illusion (Hallucination), and the verifiability and traceability facing the professional scene are difficult to ensure. Typically, search enhancement generation (RETRIEVAL-Augmented Generation, RAG) is introduced, but the existing RAG focuses on correlation generation, and explicit modeling of causality (causality) is insufficient, so that structures such as "because-so", "cause-result" and the like are difficult to systematically precipitate into operational knowledge triples, thereby restricting multi-hop reasoning and trusted questions and answers, and easily generating answers which are "plausible but not verifiable". Traditional sequence labeling methods (e.g., biLSTM-CRF, BERT-CRF) perform well in short text and stable domains, but under long text, cross-domain, domain term polytropic scenarios, entity boundaries, aliases, abbreviations, and morphological changes degrade recall and accuracy. The robustness can be improved by introducing external evidence into the labeling stage, but a unified evidence selection and scheduling mechanism is lacked. The existing causal extraction method depends on syntax/trigger word templates or unsupervised statistics (such as point mutual information (Pointwise Mutual Information, PMI)), has insufficient long-distance dependence and time sequence sensitivity on cross sentences and even cross documents, and is easy to cause misjudgment of directivity, insufficient evidence and missing consistency; In addition, single-round extraction is often limited by first-round retrieval evidence, cross-document and cross-paragraph causal information is omitted, generated triples are incomplete, causal triples extracted by different data sources can contradict each other in directionality, expression or context, and a unified conflict detection and fusion mechanism is lacked. Disclosure of Invention The invention aims to provide a causal knowledge graph construction and question-answering system based on a retrieval enhancement generation and large language model, which comprises a multi-source heterogeneous knowledge base construction module, a retrieval enhancement generation module, a named entity identification and causal triplet extraction module and a knowledge fusion reasoning module. The multi-source heterogeneous knowledge base construction module processes multi-source original text data based on multi-level vector indexes to construct a causal knowledge map. The search enhancement generation module screens a set of search contexts related to a user input query from a causal knowledge graph based on a hybrid search strategy. And the named entity recognition and causal triplet extraction module splices the search context set and the user input query and then inputs the large language model for sequence labeling to obtain an entity set. The named entity recognition and cause and effect triplet extraction module generates candidate cause and effect triples based on the entity set and the search context set, and extracts the candidate cause and effect triplet set with initial confidence coefficient not smaller than a threshold value from the candidate cause and effect triples. And the named entity identification and causal triplet extraction module carries out anti-fact retrieval on candidate causal triples with initial confidence coefficient smaller than a threshold value to obtain a final causal triplet set. And the knowledge fusion reasoning module is used for carrying out conflict detection and knowledge map fusion on the final causal triplet set to obtain a fused causal knowledge map. And the knowledge fusion reasoning module is used for reasoning the user input inquiry based on the fused causal knowledge graph and generating a corresponding answer. Further, the step of constructing the causal knowledge graph by the multi-source heterogeneous knowledge base constructing module is as follows: a1, acquiring multi-source original text data, and preprocessing the multi-source original text data. The multi-source original text data comprises named entity labeling, causal pairs, time stamps, sources and domain labels. A2, constructing a named entity annotation data set based on the preprocessed multi-source original text data. A3, constructing a causal relation data set based on the preprocessed multi-source original text data. A4, segmenting