CN-122019746-A - Knowledge enhancement retrieval and generation method and system based on semantic association graph

CN122019746ACN 122019746 ACN122019746 ACN 122019746ACN-122019746-A

Abstract

The application provides a knowledge enhancement retrieval and generation method and system based on a semantic association graph, and relates to the technical field of information retrieval. The method comprises the steps of obtaining an initial text unit sequence of a document, carrying out iterative semantic merging based on a first threshold value to generate a group of knowledge slices, calculating relevance based on the group of slices to construct a semantic association graph, carrying out expansion search based on the semantic association graph and a second threshold value after obtaining the initial slices through preliminary search to obtain expansion slices when responding to user inquiry, and combining the initial and expansion slices to generate final context information. The application solves the problem of knowledge island by constructing the semantic knowledge network, can provide more comprehensive and deep context for a large language model, and improves the searching and generating effects.

Inventors

LI MING
YUAN YE
KONG FEI

Assignees

北京中绿讯科科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251222

Claims (9)

1. The knowledge enhancement retrieval and generation method based on the semantic association graph is characterized by comprising the following steps of: the method comprises the steps of generating self-adaptive semantic fragmentation, namely acquiring an initial text unit sequence of an original document, and carrying out iterative semantic merging on the initial text unit sequence based on a preset first semantic similarity threshold value to generate a group of knowledge fragments; the second step of semantic association graph construction, namely, calculating the semantic association degree between any two knowledge slices in the knowledge slice set based on the group of knowledge slices, and constructing a semantic association graph for representing the association relation between the knowledge slices according to the semantic association degree; The method comprises the steps of enhancing search and context generation, performing preliminary search in a group of knowledge slices in response to user inquiry to obtain an initial slice set related to the user inquiry semanteme, obtaining an expansion slice with the degree of correlation with the knowledge slices in the initial slice set higher than a second semantic similarity threshold value from the group of knowledge slices based on the semantic correlation graph and the preset second semantic similarity threshold value to form an expansion slice set, and combining the initial slice set and the expansion slice set to generate context information for responding to the user inquiry.
2. The method according to claim 1, wherein in the first step, the iterative semantic merging specifically includes: Sequentially processing each text unit in the initial text unit sequence, and calculating the semantic similarity between the text unit to be processed and the knowledge slice currently being constructed; When the semantic similarity is larger than the first semantic similarity threshold, merging the content of the text unit to be processed currently to the tail end of the current knowledge slice; Otherwise, a new knowledge slice is opened by using the text unit currently to be processed.
3. The method of claim 1, wherein the calculating of the semantic similarity and the calculating of the semantic association each comprise: Converting the text into a high-dimensional vector by adopting a sentence conversion model based on deep learning; quantified by computing cosine similarity between the corresponding high-dimensional vectors of the two texts.
4. The method of claim 1, wherein the calculating of the semantic similarity and the calculating of the semantic association each comprise: constructing word frequency-inverse document frequency TF-IDF vectors for the text; quantified by computing cosine similarity between the corresponding TF-IDF vectors of the two texts.
5. The method of claim 1, wherein the obtaining the initial sequence of text units of the original document comprises: and physically cutting the original document by adopting a preset sentence breaking symbol to generate the initial text unit sequence.
6. The method of claim 1, wherein the obtaining the initial sequence of text units of the original document comprises: and identifying the structuring elements in the original document through a document parser, and taking each structuring element as an initial text unit to form the initial text unit sequence.
7. The method according to claim 1, wherein in the second step, the constructing the semantic association graph includes: for each knowledge slice, only calculating the semantic association degree between the knowledge slice and a preset number of other knowledge slices to construct the semantic association graph.
8. The method of claim 1, wherein in step three, the combining the initial set of slices and the extended set of slices comprises: And arranging knowledge slices in the initial slice set and the extended slice set according to the sequence of the knowledge slices in the original document so as to form the context information.
9. A semantic association graph-based knowledge enhancement retrieval and generation system, comprising: The system comprises a segmentation generation module, a knowledge slicing module and a segmentation processing module, wherein the segmentation generation module is used for acquiring an initial text unit sequence of an original document, and carrying out iterative semantic merging on the initial text unit sequence based on a preset first semantic similarity threshold value so as to generate a group of knowledge slices; The association construction module is used for calculating the semantic association degree between any two knowledge slices in the knowledge slice set based on the group of knowledge slices, and constructing a semantic association graph for representing the association relation between the knowledge slices according to the semantic association degree; And the enhanced retrieval module is used for responding to the user query, performing preliminary retrieval in the group of knowledge slices to obtain an initial slice set, obtaining an extended slice with the knowledge slice association degree higher than a second semantic similarity threshold value in the initial slice set based on the semantic association diagram and the preset second semantic similarity threshold value to form an extended slice set, and combining the initial slice set and the extended slice set to generate context information for responding to the user query.

Description

Knowledge enhancement retrieval and generation method and system based on semantic association graph Technical Field The application relates to the technical field of computers, in particular to the fields of natural language processing, information retrieval and artificial intelligence, and specifically relates to a knowledge enhancement retrieval and generation method and system based on a semantic association graph. Background Retrieval enhancement generation is a key technology in current large language model applications. The basic flow is that the original document is divided into a plurality of knowledge segments and stored in a vector database, when the user inquiry is received, a plurality of most relevant segments are retrieved from the knowledge segments as contexts and provided with the original inquiry together with the large language model so as to generate more accurate and reliable answers. However, in the document segmentation link, the prior art mainly adopts a fixed-size segmentation or a recursive segmentation method based on preset separators. The method is essentially based on cutting of the physical boundary of the text, and the semantic structure of the content cannot be truly understood, so that a complete semantic unit is often forcedly split into different fragments, the inherent integrity of knowledge is destroyed, and further the understanding of a large language model to the context is influenced, and the generated result is not real. In order to solve the problem, a technical scheme proposes to adopt a self-adaptive text blocking method, and dynamically determine whether to merge adjacent text units by calculating the semantic association degree of the adjacent text units and comparing the semantic association degree with a preset threshold value, so as to generate a knowledge segment with more coherent internal semantics. However, even if high quality knowledge segments are generated in this way, the segments remain macroscopically isolated, linear from each other, and deep logical relationships across multiple segments in the document, such as causal, contrasting, total scoring, etc., are completely lost. The islanding of this information results in the inability of the system to provide a comprehensive and logically deep context in response to complex queries requiring comprehensive multifaceted information, the search results still being monolithic. In addition, in the other technical scheme, a knowledge graph is adopted in knowledge question answering to carry out association retrieval. However, such schemes generally rely on pre-built, huge general knowledge patterns from massive structured or semi-structured data, or require complex entity-relationship extraction procedures, which are difficult to lightweight for application to instant, dynamic knowledge-related building and retrieval scenarios for single or small number of specific documents. Disclosure of Invention The application aims to provide a knowledge enhancement retrieval method and system based on self-adaptive slicing and semantic association graphs, and aims to solve the technical problems that in the prior art, macroscopic association relations among slices are difficult to construct and utilize while the internal semantic integrity of knowledge slices is guaranteed, so that in the application of retrieval enhancement generation, context information provided for a large language model is incomplete and lacks depth, and finally the accuracy, comprehensiveness and logics of generated answers are influenced. The method is characterized by comprising the steps of firstly, self-adaptive semantic segmentation generation, namely, obtaining an initial text unit sequence of an original document, carrying out iterative semantic merging on the initial text unit sequence based on a preset first semantic similarity threshold value to generate a group of knowledge slices, secondly, constructing a semantic association graph, namely, calculating semantic association degree between any two knowledge slices in a knowledge slice set based on the group of knowledge slices, constructing a semantic association graph for representing association relations between the knowledge slices according to the semantic association degree, thirdly, carrying out preliminary retrieval in the group of knowledge slices in response to user inquiry to obtain an initial slice set related to the user inquiry, obtaining a slice set with knowledge association degree higher than a second semantic similarity threshold value from the initial slice set based on the semantic association graph and a preset second semantic similarity threshold value, and carrying out expansion information combination on the initial slice set and the expansion slice set to generate an expansion response set. Optionally, in the first step, the iterative semantic merging specifically includes sequentially processing each text unit in the initial text unit sequence, calculatin