CN-121998053-A - Knowledge graph construction method based on word vector similarity and entity promotion

CN121998053ACN 121998053 ACN121998053 ACN 121998053ACN-121998053-A

Abstract

The invention provides a knowledge graph construction method based on word vector similarity and entity promotion, which comprises the following steps of slicing an input text based on SeqLab normal form models to obtain a semantic paragraph block set, inputting the semantic paragraph block set into a pre-trained large language model to perform entity recognition and extraction to obtain an entity list, representing and splicing an obtained global semantic vector, a local semantic vector and a type vector to obtain word vector representation of a corresponding entity, calculating cosine similarity between the word vector and the existing entity vector in the knowledge graph, promoting an optimal entity and adding an optimal entity set according to a consistency score in a context, generating triples and writing the generated triples into the knowledge graph.

Inventors

FU JIAHUI
WANG KUN
CUI HONGYAN
SU XIAODAN
Chao Yunyun
SONG YUXIANG
JIA WEIDONG

Assignees

郑州信大先进技术研究院

Dates

Publication Date: 20260508
Application Date: 20251226

Claims (8)

1. The knowledge graph construction method based on word vector similarity and entity promotion is characterized by comprising the following steps of: based on SeqLab normal form model, slicing the input text to obtain a semantic paragraph block set; inputting the semantic paragraph block set into a pre-trained large language model, and carrying out entity identification and extraction to obtain an entity list; Inputting the semantic paragraph blocks and the entity list into a semantic coding model pre-trained by a domain training set, extracting global semantic vectors, local semantic vectors and type vectors of each entity, and splicing the global semantic vectors, the local semantic vectors and the type vectors to obtain word vector representations of the corresponding entities; The cosine similarity between the word vector representation of the entity and the existing entity vector in the knowledge graph is calculated, and if the cosine similarity is greater than a certain threshold value, a candidate entity set is included; Calculating the context consistency score of each candidate entity in the candidate entity set in the input text, and if the context consistency score is larger than the preset context consistency score, the context consistency score is promoted to be the best entity and the best entity set is added; And performing dependency syntax analysis on the optimal entity set obtained from the same semantic paragraph block and the existing entity, extracting the relation type among the entities, generating a triplet, and writing the generated triplet into a knowledge graph.
2. The method for constructing a knowledge graph based on word vector similarity and entity inference of claim 1, wherein the method for obtaining global semantic vectors, local semantic vectors and type vector representations comprises: extracting the whole paragraph semantics output by the semantic coding model as global semantic vectors; extracting vector representation of a word sequence corresponding to the entity in the entity list from the output of the semantic coding model, and carrying out average pooling operation on the word sequence to obtain local semantic vectors corresponding to the entities; inquiring a trainable type embedding matrix according to a fine granularity type identifier predefined by each entity in a knowledge base in the entity list, and obtaining a type vector representation corresponding to each entity.
3. The knowledge graph construction method based on word vector similarity and entity promotion of claim 1, wherein the method for obtaining the semantic paragraph block set comprises the following steps: Performing coarse-granularity breakpoint detection on an input text by adopting SeqLab normal form model to obtain forced breakpoint marks and predicting probability values P break of each sentence end, each segment end and each title breakpoint; if the input text contains one of a table, a LaTeX or a list, identifying and marking the structured content forced breakpoint therein through regular rule matching, and obtaining a forced breakpoint mark; Based on the probability value P break and the forced breakpoint mark, calculating an optimal fragment through a dynamic programming algorithm to obtain a semantic paragraph block set, wherein an objective function of the dynamic programming is as follows: , Where i denotes the breakpoint position, len (i) is the length of the corresponding tile, L 0 denotes the target tile length, and ⋋ denotes the preset length penalty coefficient.
4. The method for constructing a knowledge graph based on word vector similarity and entity inference of claim 1, wherein the method for obtaining a context consistency score comprises: Acquiring context information of a candidate entity e in a candidate entity set; Calculate sentence-level consistency SentSim, topic consistency TopicSim, and graph neighbor consistency NeighborSim consistency scores, wherein, Calculating a sentence characterization vector He of a sentence in which the candidate entity e is located, and taking cosine similarity of the sentence characterization vector H (e) of the sentence definition or abstract sentence of the candidate entity e in a knowledge base as sentence-level consistency, wherein the sentence-level consistency calculation formula is as follows: SentSim=cos(He,H(e)) Calculating the cosine similarity between the average vector Te of the first N keywords with highest TF-IDF weights in the semantic paragraph blocks and the encyclopedic entry keyword vector T (e) of the candidate entity e as the topic consistency, and if the candidate entity e has no keyword vector, replacing the topic consistency by using a unit vector, wherein the calculation formula of the topic consistency is as follows: TopicSim=cos(Te,T(e)) Calculating average cosine similarity between a vector representation ve of a candidate entity e and a vector representation { v (eᵢ) } of a neighbor entity set { ei } of the candidate entity e in a knowledge graph as graph neighbor consistency, wherein the calculation formula of the graph neighbor consistency is as follows: , Calculating a context consistency score Ce of the candidate entity e based on a weighted formula, wherein the context consistency score calculation formula is as follows: Ce=w 1 ⋅SentSim+w 2 ⋅TopicSim+w 3 ⋅NeighborSim Wherein w 1 、w 2 、w 3 is a weight coefficient, and is dynamically configured according to the text type.
5. The method for constructing a knowledge graph based on word vector similarity and entity inference of claim 2, wherein the method for training the trainable type embedding matrix comprises: Extracting fine granularity type labels corresponding to entities from a knowledge base, constructing a type set, distributing a trainable embedded vector for each type in the type set, forming a type embedded matrix, and carrying out random initialization; And in the model training process, fine tuning is carried out on the word vector representation of the entity based on the contrast loss function, and the parameters of the trainable type embedding matrix are optimized.
6. A knowledge graph construction system based on word vector similarity and entity promotion comprises a segmentation module, an entity extraction module, a word vector generation module, an entity promotion module and a relationship construction module; the segmentation module is used for segmenting the input text based on SeqLab normal form model to obtain a semantic paragraph block set; the entity extraction module is used for inputting the semantic paragraph block set into a pre-trained large language model, and carrying out entity identification and extraction to obtain an entity list; the word vector generation module is used for inputting the semantic paragraph blocks and the entity list into a semantic coding model pre-trained by a domain training set, extracting global semantic vectors, local semantic vectors and type vectors of each entity and splicing the global semantic vectors, the local semantic vectors and the type vectors to obtain word vector representations of the corresponding entities; The entity promotion module is used for calculating cosine similarity between the word vector representation of the entity and the existing entity vector in the knowledge graph, and if the cosine similarity is greater than a certain threshold value, the entity promotion module is used for incorporating the candidate entity set; The method comprises the steps of calculating the context consistency score of each candidate entity in the candidate entity set in an input text, and if the context consistency score is larger than a preset context consistency score, the candidate entity is promoted to be the best entity and added into the best entity set; the relation construction module is used for carrying out dependency syntax analysis on the optimal entity set obtained from the same semantic paragraph block and the existing entity, extracting the relation type among the entities, generating a triplet and writing the generated triplet into the knowledge graph.
7. The computer equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; The knowledge graph construction method based on word vector similarity and entity promotion according to any one of claims 1 to 5 is realized when the program stored in the memory is executed.
8. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the knowledge graph construction method based on word vector similarity and entity election as claimed in any one of claims 1 to 5.

Description

Knowledge graph construction method based on word vector similarity and entity promotion Technical Field The invention relates to the technical field of knowledge graph construction, in particular to a knowledge graph construction method based on word vector similarity and entity promotion. Background Along with the deep fusion of big data and artificial intelligence technology, the knowledge graph is used as a core carrier of structured semantic knowledge, and the key value is shown in the fields of intelligent searching, question answering systems, decision reasoning and the like, so that the knowledge graph becomes an inevitable product of the information technology from data interconnection to intelligent interconnection. At the enterprise level, the system can connect the data of clients, products and supply chains scattered in different systems to form a unified knowledge view, thereby enabling intelligent wind control, personalized recommendation and accurate marketing. At the industry level, the medical knowledge graph assists clinical decision, the financial knowledge graph reveals a risk conduction path, and the smart city knowledge graph optimizes public resource scheduling. In the patent of the authorized bulletin number CN119204182B, a method, a system and a storage medium for constructing a knowledge graph in the civil aviation service field are provided, a BERT-BiLSTM-CRF algorithm model is adopted to conduct entity extraction processing and obtain an entity vector sequence, a feature vector sequence and a labeling sequence which are mutually related, a sentence vector and an included entity vector are extracted based on a convolutional neural network model, a triplet database of entity-relation-entity is obtained through n filters identification extraction, label information is integrated and correspondingly stored in the entity through a conditional random field entity node integration model and used as an attribute value of the entity, and the triplet database and the integrated attribute value of the entity are utilized to conduct link fusion construction to obtain the civil aviation knowledge graph. The knowledge graph constructed by the existing scheme lacks the capability of continuous evolution and automatic updating, and is difficult to adapt to dynamically-changing environments. Meanwhile, the semantic similarity calculation is only dependent on a single index of cosine similarity, and subtle differences and complex relations of the semantic are not characterized sufficiently, so that accuracy and completeness of the map are limited. Therefore, the invention provides a knowledge graph construction method based on word vector similarity and entity promotion. Disclosure of Invention The invention aims to provide a knowledge graph construction method based on word vector similarity and entity promotion aiming at the defects of the prior art. The method comprises the steps of firstly carrying out word segmentation and vectorization representation on a text, inputting the processed text into a pre-trained large language model to generate an entity candidate pool, further, pushing up target entities from the candidate pool based on word vector semantic similarity, and calculating deep semantic relations among the entities. The method captures context semantic information in natural language through a word vector model, automatically identifies potential entities and association relations thereof in the text by combining an entity lifting mechanism, and finally realizes efficient construction and dynamic updating of the knowledge graph. In order to achieve the above object, the first aspect of the present invention provides a knowledge graph construction method based on word vector similarity and entity promotion, comprising the steps of: based on SeqLab normal form model, slicing the input text to obtain a semantic paragraph block set; inputting the semantic paragraph block set into a pre-trained large language model, and carrying out entity identification and extraction to obtain an entity list; Inputting the semantic paragraph blocks and the entity list into a semantic coding model pre-trained by a domain training set, extracting global semantic vectors, local semantic vectors and type vectors of each entity, and splicing the global semantic vectors, the local semantic vectors and the type vectors to obtain word vector representations of the corresponding entities; The cosine similarity between the word vector representation of the entity and the existing entity vector in the knowledge graph is calculated, and if the cosine similarity is greater than a certain threshold value, a candidate entity set is included; Calculating the context consistency score of each candidate entity in the candidate entity set in the input text, and if the context consistency score is larger than the preset context consistency score, the context consistency score is promoted to be the best entity and the bes