CN-122021654-A - Library knowledge discovery method and system based on semantic enhancement and large model collaboration

CN122021654ACN 122021654 ACN122021654 ACN 122021654ACN-122021654-A

Abstract

The invention provides a library knowledge discovery method and system based on the cooperation of semantic enhancement and a large model, wherein the method comprises the steps of extracting entities from scientific documents and mapping the extracted entities into unified subject words; constructing a continuous time bipartite graph, wherein a node set comprises document nodes and subject term nodes, and edges in an edge set represent that interaction events exist between the document nodes and the subject term nodes; enumerating all unordered subject word pairs for a subject word set of a scientific document, generating semantic information and updating the display memory state of nodes related to the target scientific document according to the semantic information when the target scientific document is newly added in the collection of the library, acquiring time sequence embedding for the information of the current node in a space-time neighborhood for the nodes in each unordered subject word pair, predicting the probability of forming an interaction event according to the time sequence embedding of the unordered subject word pair, selecting a knowledge candidate set according to the probability, and inputting the candidate association, the confidence and the corresponding context in the candidate set into a large language model to generate an information analysis report.

Inventors

WANG QINGQING
CHEN QIUJU

Assignees

中国科学技术大学

Dates

Publication Date: 20260512
Application Date: 20260415

Claims (10)

1. A library knowledge discovery method based on semantic enhancement and large model collaboration, the method comprising: Extracting subject terms from each scientific document in the heterogeneous collection resource; constructing a continuous time bipartite graph g= (V, E, T) based on the occurrence time of each scientific document, wherein the node set V comprises document nodes and subject term nodes, and edges in the edge set E represent that an interaction event exists between the document nodes and the subject term nodes occurring at the time stamp T; enumerating all unordered pairs of subject words for the subject word set of each scientific document in the continuous time bipartite graph to form a time sequence co-occurrence event flow graph; when a target scientific document is newly added in a collection of the collection and a disordered subject word pair in a time sequence co-occurrence event flow diagram is triggered, generating a semantic message by adopting a multi-layer perceptron, updating the display memory state of a subject word node related to the target scientific document according to the semantic message, and fusing the historical display memory state of the related subject word node, the time characteristic Fourier code of the disordered subject word pair between the current occurrence time and the last occurrence time and the semantic characteristic of the target scientific document; For nodes in the time sequence co-occurrence event flow graph, aggregating the information of the current node in the space-time neighborhood to obtain a time sequence embedded vector of the current node; Inputting the time sequence embedded vectors of the nodes in the unordered subject word pair into a pre-trained semantic enhancement time sequence graph network model to predict the probability of forming interaction events among the subject word nodes, and constructing a knowledge candidate set according to the probability; and inputting the candidate association, the confidence coefficient and the context corresponding to the candidate association in the knowledge candidate set into a preset large language model to generate an information analysis report.
2. The method of claim 1, wherein extracting subject terms from each scientific document in the heterogeneous collection of resources comprises: Extracting the entity of each scientific document in the heterogeneous collection resource; and carrying out term normalization on the entity extracted from each scientific document according to a preset domain thesaurus so as to normalize different expressions of the same term in the scientific document into a uniformly expressed subject term.
3. The method of claim 1, wherein prior to constructing the continuous-time bipartite graph based on the time of occurrence of each scientific document, the method further comprises: Extracting the title, abstract, keyword, class number and/or reference relation of each scientific document, splicing the title, abstract, keyword, class number and/or reference relation into a long text sequence, and carrying out deep semantic feature coding on the long text sequence to obtain corresponding deep semantic features.
4. The method of claim 1, wherein generating semantic messages using a multi-layer perceptron comprises: extracting deep semantic features of a target scientific literature; Calculating the time characteristic Fourier code of the time interval between the current occurrence time and the last occurrence time of the unordered subject word pair triggered by the target scientific literature; Performing vector splicing on the historical display memory state of the subject word node related to the target scientific literature, the Fourier codes of the related unordered subject word pair corresponding to the time features and the deep semantic features of the target scientific literature, and inputting the obtained spliced vector into a preset multi-layer perceptron to generate semantic information : ; Wherein, the The vector concatenation operation is represented by a vector, Is a multi-layer sensing machine, which is a multi-layer sensing machine, The memory state is displayed for the history of node i, The memory state is displayed for the history of node j, For the current time t and last time of occurrence of unordered subject word pairs The time characteristic fourier codes of the time intervals in between, Is a deep semantic feature of the target scientific literature.
5. The method of claim 4, wherein calculating a time feature fourier code of a time interval between a current occurrence time and a last occurrence time of a target scientific literature-triggered unordered subject matter word pair comprises: for the time interval between the current occurrence time and the last occurrence time of unordered subject word pairs in a time sequence co-occurrence event flow graph Mapping the time intervals using fourier features The conversion to a high-dimensional vector results in a time-feature fourier code that characterizes the periodic pattern of disciplinary knowledge propagation, as follows: ; Wherein, the As a parameter of the frequency that can be learned, The dimension is encoded for a set time.
6. The method of claim 1, wherein updating the display memory state of the subject matter node associated with the target scientific literature based on the semantic message comprises: the method comprises the steps of aggregating all semantic messages received by a subject term node i related to a target scientific literature at a time t to obtain an aggregated semantic message ; The gating circulation unit is used as a memory updater to update the display memory state of the node in combination with the aggregated semantic message: 。
7. the method of claim 1, wherein aggregating information of a current node in a space-time neighborhood for nodes in a time-sequence co-occurrence event flow graph to obtain a time-sequence embedded vector of the current node, comprises: adopting a graph attention mechanism to dynamically allocate weights according to the semantic similarity of the current node and the adjacent node and the recency degree of interaction time ; According to the weight Aggregating the display memory state, time feature Fourier code and deep semantic feature of the current node and the adjacent nodes in the space-time neighborhood to obtain the time sequence embedded vector of the current node : Is the text semantic direction of the neighboring node j of node i, Is a time-characteristic fourier code of the time difference between the current node and the neighboring node j in the spatio-temporal neighborhood.
8. The method of claim 1, wherein for any unordered subject word pair, inputting the time-sequential embedding vector of its node into the pre-trained semantic enhanced time-sequential graph network model to predict the probability of forming an interaction event between the unordered subject word pair, comprises: Inner product calculation is carried out on time sequence embedded vectors of nodes u and v in any unordered subject word pair to measure the similarity of two vectors, and the time sequence embedded vectors are mapped into probabilities between 0 and 1 through sigmoid : , Wherein, the , For the time sequence embedded vectors of the nodes u, v at the latest moment T, T is the matrix/vector transpose, Is a sigmoid function; constructing a loss function of a semantic enhanced time sequence graph network model by adopting a binary cross entropy loss function, and enhancing the identification capability of the model to real association through a negative sampling strategy, wherein the loss function is as follows: for training a positive sample edge set of unordered pairs of subject terms in which an interaction event actually exists, For a negative-sample edge set extracted from unordered subject word pairs in which no real interaction event exists in the training data, v 'represents a negative-sample node in the negative-sample edge set, (u, v') e The fact that no real co-occurrence relation exists between the subject term node u and the subject term node v' is shown; 、 And And the time sequence embedded vectors corresponding to the subject word node u, the subject word node v and the subject word node v' at the current moment are respectively represented.
9. A library knowledge discovery system based on semantic enhancement in conjunction with a large model, the system comprising: The multisource data fusion processing module is used for extracting subject words from all scientific documents in the heterogeneous collection resources; dynamic map construction module for constructing continuous time bipartite map based on occurrence time of each scientific literature Wherein the node sets Comprising document nodes and subject term nodes, edge sets The edge representation in (a) is at the timestamp An interaction event exists between the generated document node and the subject term node; the co-occurrence event stream generating module enumerates all unordered subject word pairs for the subject word set of each scientific document in the continuous time bipartite graph to form a time sequence co-occurrence event flow graph; The diagram structure evolution updating module is used for generating semantic messages by adopting a multi-layer perceptron when a target scientific document is newly added in the collection of the librarian and a disordered subject word pair in a time sequence co-occurrence event flow diagram is triggered, updating the display memory state of a subject word node related to the target scientific document according to the semantic messages, and fusing the historical display memory state of the related subject word node, the time characteristic Fourier code of the disordered subject word pair between the current occurrence time and the last occurrence time and the semantic characteristics of the target scientific document; the time sequence aggregation module is used for aggregating the information of the current node in the space-time neighborhood for the nodes in the time sequence co-occurrence event flow graph to obtain a time sequence embedded vector of the current node; The candidate set prediction module is used for inputting the time sequence embedded vectors of the nodes in the unordered subject word pair into a pre-trained semantic enhancement time sequence graph network model so as to predict the probability of forming interaction events among the subject word nodes, and constructing a knowledge candidate set according to the probability; And the report generation module is used for inputting the candidate association, the confidence coefficient and the context corresponding to the candidate association in the knowledge candidate set into a preset large language model to generate an information analysis report.
10. The system of claim 9, wherein the system further comprises: The deep semantic feature extraction module is used for extracting the title, abstract, keyword, classification number and/or reference relation of each scientific document before the dynamic map construction module constructs the continuous time bipartite map based on the occurrence time of each scientific document, splicing the title, abstract, keyword, classification number and/or reference relation into a long text sequence, and carrying out deep semantic feature coding on the long text sequence to obtain the corresponding deep semantic feature.

Description

Library knowledge discovery method and system based on semantic enhancement and large model collaboration Technical Field The invention relates to the technical field of artificial intelligence and intelligent libraries, in particular to a library knowledge discovery method and system based on semantic enhancement and large model cooperation. Background With the advent of the digital publishing age, the role of libraries is being shifted from traditional literature storage centers to intelligent knowledge service centers. In the face of exponentially growing electronic journals, academic papers and industry reports, library users, especially scientific researchers, have difficulty rapidly capturing subject evolving venues and leading edge hot spots from vast, flue-sea collection resources. Traditional liberal services rely mostly on manual literature reviews or static keyword based metrology analysis, which is time consuming and laborious and often lags behind the real-time evolution of scientific findings. What is needed is an intelligent system that can provide deep understanding of the deep semantics of the collection of resources, dynamically capture the flow trajectories of discipline knowledge, and provide readers with prospective knowledge navigation services. However, in the current intelligent intelligence mining technology, how to effectively integrate the structured quotation network evolution information with unstructured text deep semantics is still a serious challenge facing the technical field. In the prior art, the distance is calculated by utilizing a dynamic time warping algorithm mainly according to the similarity of word frequency time sequences of keywords in scientific literature on the shape, so that automatic cluster recognition of word frequency evolution trend of the keywords is realized. The technical scheme focuses on mining explicit statistical rules, and can effectively capture the quantity change trend of hot words, but because the deep semantic association behind keywords and the reference structure information among documents are ignored, implicit discipline knowledge that word frequencies have not yet exploded but have potential logic association is difficult to find. In the prior art, a multi-source heterogeneous scientific knowledge graph is constructed, scientific literature is managed according to the flow of knowledge modeling, extraction, fusion, verification and storage, and the knowledge graph technology is utilized to mine the association between scientific entities. Although the method can organize scientific knowledge in a structured form, the graph construction process of the method often depends on complex rules or predefined modes and is often expressed as a static knowledge snapshot, the dynamic characteristics of continuous evolution of subject concepts along with time are difficult to capture in a fine granularity, and hysteresis exists in the prediction of the trend of the emerging scientific research. Along with the development of large model technology, the prior art also proposes a method for realizing knowledge discovery in an open domain by constructing an input template comprising a head entity, a prompt and a mask and predicting the missing entity in the triplet by using a pre-training language model. The method utilizes the strong semantic generalization capability of the deep learning model, reduces the dependence on artificial feature engineering, but is easy to generate logic illusions based on a generating mode of the language model, and lacks the constraint of a bottom layer quotation network structure, so that the generated discipline is assumed to lack the evidence support of the conclusive collection literature. In summary, the existing statistical mining method is more in word frequency analysis on the surface layer, so that deep semantic logic is difficult to be obtained, the static knowledge graph-based method is difficult to capture fine granularity evolution characteristics in continuous time, and the pure large model generation method is rich in semantics but lacks in perception of global academic ecology. These problems limit the practical application of the prior art in deep knowledge services in smart libraries. Disclosure of Invention In view of the foregoing, the present invention has been made in order to provide a library knowledge discovery method and system based on semantic enhancement in conjunction with a large model, which overcomes the foregoing problems. The invention provides a library knowledge discovery method based on semantic enhancement and large model cooperation, which comprises the following steps: Extracting subject terms from each scientific document in the heterogeneous collection resource; constructing a continuous time bipartite graph g= (V, E, T) based on the occurrence time of each scientific document, wherein the node set V comprises document nodes and subject term nodes, and edges in the edge set E rep