CN-121981243-A - Method and system for extracting knowledge from text

CN121981243ACN 121981243 ACN121981243 ACN 121981243ACN-121981243-A

Abstract

The application provides a method and a system for extracting knowledge from a text, and relates to the technical field of knowledge maps, wherein the method comprises the steps of roughly extracting a target text through a knowledge extraction model to obtain candidate entities and corresponding confidence coefficients; according to candidate entities in the target text, an entity co-occurrence matrix is constructed, element characterization of an ith row and a jth column in the entity co-occurrence matrix is carried out, the co-occurrence frequency of the ith candidate entity and the jth candidate entity in the target text is represented, the target text is extracted finely according to the entity co-occurrence matrix and each candidate entity in the target text through a knowledge extraction model, confidence coefficients of each candidate relation and each candidate relation in the target text are obtained, the candidate relation with the confidence coefficient larger than a confidence coefficient threshold value is determined as the relation extracted from the target text, and a plurality of entities extracted from the target text are determined according to a head entity and a tail entity corresponding to the candidate relation with the confidence coefficient larger than the confidence coefficient threshold value. The knowledge extraction accuracy is improved.

Inventors

YANG PEIYING
PENG BAOYUN
SUN XIAO
GUO PENGFEI
NIE XIAONING
ZHU LIHUA

Assignees

北京大数据先进技术研究院

Dates

Publication Date: 20260505
Application Date: 20260121

Claims (9)

1. A method for knowledge extraction of text, the method comprising: Coarsely extracting a target text through a knowledge extraction model to obtain each candidate entity and confidence coefficient of each candidate entity in the target text; Constructing an entity co-occurrence matrix corresponding to the target text according to each candidate entity in the target text, wherein the element characterization of the ith row and the jth column in the entity co-occurrence matrix comprises the co-occurrence frequency of the ith candidate entity and the jth candidate entity in the target text; according to the entity co-occurrence matrix corresponding to the target text and each candidate entity in the target text, finely extracting the target text through the knowledge extraction model to obtain each candidate relation in the target text and the confidence coefficient of each candidate relation; And determining the candidate relation with the confidence coefficient larger than the confidence coefficient threshold as the relation extracted from the target text, and determining a plurality of entities extracted from the target text according to the head entity and the tail entity corresponding to the candidate relation with the confidence coefficient larger than the confidence coefficient threshold.
2. The method for knowledge extraction of text as claimed in claim 1, further comprising: Determining all positions of each candidate entity in the target text; Under the condition that the number of characters between the position corresponding to the ith candidate entity and the position corresponding to the jth candidate entity is smaller than the first number of characters, determining that the ith candidate entity and the jth candidate entity co-occur once in the target text so as to obtain the number of times that the ith candidate entity and the jth candidate entity co-occur in the target text; Determining the co-occurrence original frequency of the ith candidate entity and the jth candidate entity in the target text according to the co-occurrence times and the possible co-occurrence total times of the ith candidate entity and the jth candidate entity in the target text; under the condition that the co-occurrence original frequency of the ith candidate entity and the jth candidate entity in the target text is larger than a co-occurrence threshold value, the co-occurrence original frequency of the ith candidate entity and the jth candidate entity in the target text is increased based on a first coefficient, and the co-occurrence frequency of the ith candidate entity and the jth candidate entity in the target text is obtained; and under the condition that the co-occurrence frequency of the ith candidate entity and the jth candidate entity in the target text is not greater than the co-occurrence threshold, determining the co-occurrence frequency of the ith candidate entity and the jth candidate entity in the target text as the co-occurrence frequency of the ith candidate entity and the jth candidate entity.
3. The method for knowledge extraction of text according to claim 1, wherein the step of performing fine extraction on the target text through the knowledge extraction model according to the entity co-occurrence matrix corresponding to the target text and each candidate entity in the target text to obtain each candidate relationship in the target text and the confidence level of each candidate relationship comprises the following steps: Extracting the relation of the target text through the knowledge extraction model to obtain each candidate relation in the target text and the original confidence coefficient of each candidate relation; under the condition that the co-occurrence frequency of a head entity and a tail entity corresponding to a candidate relation in the target text is smaller than the first co-occurrence frequency, the original confidence coefficient of the candidate relation is reduced based on a second coefficient, and the confidence coefficient of the candidate relation is obtained; under the condition that the co-occurrence frequency of a head entity and a tail entity corresponding to a candidate relation in the target text is larger than the second co-occurrence frequency, the original confidence coefficient of the candidate relation is increased based on a third coefficient, and the confidence coefficient of the candidate relation is obtained; And determining the original confidence of the candidate relation as the confidence of the candidate relation under the condition that the co-occurrence frequency of the head entity and the tail entity corresponding to the candidate relation in the target text is between the first co-occurrence frequency and the second co-occurrence frequency.
4. The method for knowledge extraction of text according to claim 1, wherein determining a plurality of entities extracted from the target text based on a head entity and a tail entity corresponding to a candidate relationship having a confidence greater than a confidence threshold comprises: performing semantic equivalence detection on a head entity and a tail entity corresponding to the candidate relation with the confidence coefficient larger than the confidence coefficient threshold; only one entity is reserved in a group of entities with semantic equivalence, and a plurality of entities which are extracted from the target text and have no semantic redundancy are obtained; Wherein the step of semantic equivalence detection comprises: the method comprises the steps of taking a head entity and a tail entity corresponding to a candidate relation with confidence coefficient larger than a confidence coefficient threshold value as an entity set, taking one entity in the entity set as a first entity, taking the rest of each entity in the entity set as a second entity, determining vector representation of the first entity and vector representation of the second entity, and determining vector representation similarity of the first entity and the second entity; Determining a context of the first entity according to the first appearance position of the first entity in the original text, determining a context of the second entity according to the first appearance position of the second entity in the original text, and determining the context similarity of the first entity and the second entity; Determining a fusion score of the second entity according to the vector representation similarity and the context similarity of the first entity and the second entity; and determining the second entity with the highest fusion score as an entity equivalent to the semantics of the first entity.
5. The method for knowledge extraction of text according to claim 1, wherein determining a plurality of entities extracted from the target text based on a head entity and a tail entity corresponding to a candidate relationship having a confidence greater than a confidence threshold, further comprises: content repetition detection is carried out on a head entity and a tail entity corresponding to the candidate relation with the confidence coefficient larger than the confidence coefficient threshold value; only one entity is reserved in a group of entities with repeated content, and a plurality of entities which are extracted from the target text and have no repeated content are obtained; Wherein the step of content repetition detection includes: Taking a head entity and a tail entity corresponding to a candidate relation with the confidence coefficient larger than a confidence coefficient threshold value as an entity set, and taking each entity in the entity set as a node; determining the similarity between a first entity and a second entity according to the multi-dimensional information of the first entity and the second entity in the entity set, wherein the multi-dimensional information comprises names, types, aliases and descriptions; Adding an edge between a first node corresponding to the first entity and a second node corresponding to the second entity under the condition that the similarity between the first entity and the second entity is larger than a similarity threshold value, so as to obtain a similarity graph corresponding to the entity set; Extracting a plurality of connected subgraphs from the similarity graph, wherein each entity corresponding to each node in each connected subgraph is a group of entities with repeated contents; Only one entity is reserved in a set of entities of content duplication, including: among a set of entities whose contents are repeated, the entity with the highest confidence is determined as a standard entity and is reserved.
6. The method for knowledge extraction of text as claimed in claim 1, further comprising: after the 1 st knowledge extraction is completed, performing a first returnal thinking on the knowledge extraction model in the 1 st knowledge extraction process to obtain a knowledge extraction model for the 2 nd knowledge extraction; Under the condition that a knowledge extraction model for carrying out the t-th knowledge extraction is obtained through disbelief, after the t-th knowledge extraction is completed, determining the average confidence coefficient of a t-th knowledge extraction result, comparing the t-th knowledge extraction result with a t-1-th knowledge extraction result, and determining the improvement amplitude of the t-th knowledge extraction result, wherein t is an integer larger than 1; Stopping disbelief and determining a knowledge extraction model for carrying out the t-th knowledge extraction as a knowledge extraction model for carrying out the knowledge extraction every time later when the improvement amplitude of the t-th knowledge extraction result is smaller than a first improvement amplitude threshold and the average confidence level of the t-th knowledge extraction result is larger than a first confidence level threshold; Performing a first negative thinking under the condition that the average confidence coefficient of the t-th knowledge extraction result is smaller than a second confidence coefficient threshold and the current total negative thinking times do not reach the maximum negative thinking times, so as to obtain a knowledge extraction model for t+1th knowledge extraction; and carrying out one-time rethresh under the condition that the improvement amplitude of the t-th knowledge extraction result is larger than a second improvement amplitude threshold and the current total rethresh times do not reach the maximum rethresh times, so as to obtain a knowledge extraction model for t+1th knowledge extraction.
7. The method of claim 6, wherein determining the average confidence level of the result of the t-th knowledge extraction comprises: determining the confidence coefficient of each entity extracted for the t time, and determining the confidence coefficient of each relation extracted for the t time; determining the average confidence coefficient of the knowledge extraction result of the t time according to the confidence coefficient of each entity and each relation extracted of the t time; comparing the t-th knowledge extraction result with the t-1 th knowledge extraction result to determine an improvement amplitude of the t-th knowledge extraction result, comprising: Determining a quantity improvement value of the t knowledge extraction according to the difference value of the total quantity of the entities and the relations extracted for the t time and the total quantity of the entities and the relations extracted for the t-1 time; Under the condition that the average confidence coefficient of the t-th knowledge extraction result is larger than the average confidence coefficient of the t-1 th knowledge extraction result, determining a difference value between the average confidence coefficient of the t-th knowledge extraction result and the average confidence coefficient of the t-1 th knowledge extraction result as a confidence coefficient lifting value of the t-th knowledge extraction; evaluating the richness of the entity and the relation extracted for the t time to obtain the richness of the knowledge extraction for the t time, and evaluating the richness of the entity and the relation extracted for the t-1 time to obtain the richness of the knowledge extraction for the t-1 time; Under the condition that the richness of the t-th knowledge extraction result is larger than the richness of the t-1 th knowledge extraction result, determining a difference value between the richness of the t-th knowledge extraction result and the richness of the t-1 th knowledge extraction result as a richness lifting value of the t-th knowledge extraction; And fusing the quantity improvement value, the confidence coefficient improvement value and the richness improvement value of the t-th knowledge extraction to obtain the improvement amplitude of the t-th knowledge extraction result.
8. A method of knowledge extraction of text as claimed in claim 5, the similarity threshold is determined according to the following steps: Increasing an original similarity threshold based on a type adjustment amount under the condition that the type of the first entity and the type of the second entity both belong to a target type; reducing the raw similarity threshold based on a first information density adjustment amount in the case where the information density of the context of the first entity and the information density of the context of the second entity are both greater than a first information density threshold, and increasing the raw similarity threshold based on a second information density adjustment amount in the case where the information density of the context of the first entity and the information density of the context of the second entity are both less than a second information density threshold; And determining whether an edge is added between a first node corresponding to the first entity and a second node corresponding to the second entity according to the similarity threshold obtained after the original similarity threshold is adjusted.
9. A system for knowledge extraction of text, the system comprising: the coarse extraction module is used for performing coarse extraction on the target text through the knowledge extraction model to obtain each candidate entity in the target text and the confidence coefficient of each candidate entity; The matrix construction module is used for constructing an entity co-occurrence matrix corresponding to the target text according to each candidate entity in the target text, wherein the element characterization of the ith row and the jth column in the entity co-occurrence matrix comprises the co-occurrence frequency of the ith candidate entity and the jth candidate entity in the target text; The fine extraction module is used for carrying out fine extraction on the target text through the knowledge extraction model according to the entity co-occurrence matrix corresponding to the target text and each candidate entity in the target text to obtain each candidate relation in the target text and the confidence coefficient of each candidate relation; And the entity and relation determining module is used for determining the candidate relation with the confidence coefficient larger than the confidence coefficient threshold value as the relation extracted from the target text, and determining a plurality of entities extracted from the target text according to the head entity and the tail entity corresponding to the candidate relation with the confidence coefficient larger than the confidence coefficient threshold value.

Description

Method and system for extracting knowledge from text Technical Field The application relates to the technical field of knowledge maps, in particular to a method and a system for extracting knowledge from texts. Background In the current information explosion era background, how to efficiently and automatically extract structured knowledge from massive unstructured text data and construct a large-scale knowledge graph has become a core research topic of great attention in the artificial intelligence field. Building a knowledge graph generally requires three key stages of data annotation, knowledge extraction and knowledge fusion. The knowledge extraction is used as a key technology for constructing a knowledge graph and is responsible for identifying and extracting knowledge elements such as entities, relations, events and the like from unstructured texts, and structured knowledge input is provided for a subsequent knowledge fusion stage. The knowledge extraction is used as a key technology for constructing a knowledge graph, and the development of the knowledge graph is limited by the limitations of the traditional method for a long time. The method based on the rule or early pre-training model often shows the problems of low accuracy, limited generalization capability, difficulty in capturing deep semantic relations and the like when processing large-scale, high-complexity and unstructured texts. Disclosure of Invention In view of the above, the present application provides a method and system for knowledge extraction of text. Aims to solve or partially solve the problems existing in the background art. The first aspect of the application provides a method for knowledge extraction of text, comprising the following steps: Coarsely extracting a target text through a knowledge extraction model to obtain each candidate entity and confidence coefficient of each candidate entity in the target text; Constructing an entity co-occurrence matrix corresponding to the target text according to each candidate entity in the target text, wherein the element characterization of the ith row and the jth column in the entity co-occurrence matrix comprises the co-occurrence frequency of the ith candidate entity and the jth candidate entity in the target text; according to the entity co-occurrence matrix corresponding to the target text and each candidate entity in the target text, finely extracting the target text through the knowledge extraction model to obtain each candidate relation in the target text and the confidence coefficient of each candidate relation; And determining the candidate relation with the confidence coefficient larger than the confidence coefficient threshold as the relation extracted from the target text, and determining a plurality of entities extracted from the target text according to the head entity and the tail entity corresponding to the candidate relation with the confidence coefficient larger than the confidence coefficient threshold. In a second aspect, the present application provides a system for knowledge extraction of text, the system comprising: the coarse extraction module is used for performing coarse extraction on the target text through the knowledge extraction model to obtain each candidate entity in the target text and the confidence coefficient of each candidate entity; The matrix construction module is used for constructing an entity co-occurrence matrix corresponding to the target text according to each candidate entity in the target text, wherein the element characterization of the ith row and the jth column in the entity co-occurrence matrix comprises the co-occurrence frequency of the ith candidate entity and the jth candidate entity in the target text; The fine extraction module is used for carrying out fine extraction on the target text through the knowledge extraction model according to the entity co-occurrence matrix corresponding to the target text and each candidate entity in the target text to obtain each candidate relation in the target text and the confidence coefficient of each candidate relation; And the entity and relation determining module is used for determining the candidate relation with the confidence coefficient larger than the confidence coefficient threshold value as the relation extracted from the target text, and determining a plurality of entities extracted from the target text according to the head entity and the tail entity corresponding to the candidate relation with the confidence coefficient larger than the confidence coefficient threshold value. The method for extracting the knowledge from the text has the following advantages: And the entity co-occurrence information is used as priori knowledge to effectively filter the low-quality and possibly wrong relation. By enhancing the relation confidence of the high co-occurrence frequency entity pairs, the relation extraction accuracy is improved. The coarse extraction-fine extraction mode ensures the reca