CN-121998062-A - Semi-open type multi-language entity relation extraction method and system without marking data

CN121998062ACN 121998062 ACN121998062 ACN 121998062ACN-121998062-A

Abstract

The invention belongs to the technical field of information extraction of natural language, and relates to a semi-open type multi-language entity relation extraction method and system without marking data. The method comprises the steps of extracting semantic connection relations among words in sentences by adopting a dependency analysis tool, establishing a semantic undirected graph according to the semantic connection relations, carrying out graph search with specified depth on a head entity and a tail entity on the semantic undirected graph, taking intersection of search results as a relation word set, and obtaining relations among the entities according to the relation word set. Most existing relation extraction methods rely on a predetermined relation type system, which is complex and time-consuming. The invention avoids complex model training process based on the existing mature natural language processing tool, greatly reduces time cost, and is suitable for English, chinese and other languages.

Inventors

WANG WENJIA
WEI YU
GUO JIANING
ZHANG TAO

Assignees

天津天士力数智中医药科技有限公司

Dates

Publication Date: 20260508
Application Date: 20241104

Claims (10)

1. A semi-open type multi-language entity relation extraction method without marking data is characterized by comprising the following steps: Extracting semantic connection relations among words in sentences by adopting a dependency analysis tool, and establishing a semantic undirected graph according to the semantic connection relations; carrying out map search with specified depth on the head entity and the tail entity on the semantic undirected map, and taking intersection of search results as a relation word set; and obtaining the relation between the entities according to the relation word set.
2. The method of claim 1, wherein extracting semantic connection relationships between words in a sentence using a dependency analysis tool comprises: Performing word segmentation, part-of-speech tagging, named entity recognition and dependency analysis on each single sentence to obtain single sentences with word segmentation, part-of-speech tagging and dependency analysis information; Defining a sentence as consisting of a set of head entities, a set of tail entities, predicate verbs, and any other words; The predicate verbs are used as relation word connector entities and tail entities to form triples.
3. The method of claim 2, wherein the part-of-speech tagging is defined as a set { F|verb ',' no ',' adv ',' conj ',' auxliary ',' verb, conj verb conjunctions, adv adverbs, auxliary adverbs, and the dependency analysis is defined as a set { L|SBV ',' VOB ',' ATT ',' RAD ',' COO ',' wherein SBV stands for principal term, VOB stands for guest term, ATT represents a modified relationship, RAD represents a right additional relationship, COO represents a parallel relationship.
4. The method of claim 1, wherein the establishing a semantic undirected graph based on semantic connection relationships comprises: A semantic undirected graph G is built for each sentence, wherein each word is used as a node to form a node set W, the part-of-speech labeling result of each word forms the characteristic of each node v, the dependency analysis result L forms an edge set E, each edge represents the semantic connection relation among the words, and the distance among the nodes reflects the semantic association strength among the nodes.
5. The method of claim 1, wherein performing a graph search of a specified depth on the semantic undirected graph for the head entity and the tail entity comprises: Starting from the head entity node h, performing depth-first search, wherein the search depth is controlled by a preset parameter N, and obtaining a strong association word set R H related to the head entity; Starting from the tail entity node t, performing depth-first search, wherein the search depth is controlled by a preset parameter M, and obtaining a strong association word set R T related to the tail entity; And taking intersection of the strong association word set R H related to the head entity and the strong association word set R T related to the tail entity to obtain a relation word set R which has strong semantic association with both the head entity and the tail entity.
6. The method of claim 1, wherein obtaining the relationships between the entities according to the set of relationship words comprises recording path lengths between the relationship words and the target entity in the graph searching process, calculating the confidence level of each word in the set of relationship words according to the path lengths after obtaining the set of relationship words, and selecting the word with the largest confidence level as the final relationship word result.
7. The method of claim 6, wherein the confidence level is calculated as: wherein n is the number of nodes in the undirected graph, r degree represents the degree of the relational term node r, Representing the degree centrality of a relation word node r, wherein the larger the value is, the larger the influence of the node in the graph is, d (h, r) represents the node distance from a head entity node h to the relation word node r, d (t, r) represents the node distance from a tail entity node t to the relation word node r, the smaller the node distance is, the stronger the semantic association is, and the higher the confidence is.
8. A semi-open multilingual entity relationship extraction system without labeling data, comprising: The semantic undirected graph construction module is used for extracting semantic connection relations among words in sentences by adopting a dependency analysis tool and establishing a semantic undirected graph according to the semantic connection relations; the relation word set construction module is used for carrying out map search with specified depth on the head entity and the tail entity on the semantic undirected map, and taking intersection of search results as a relation word set; and the relation extraction module is used for obtaining the relation between the entities according to the relation word set.
9. A computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-7.
10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1-7.

Description

Semi-open type multi-language entity relation extraction method and system without marking data Technical Field The invention belongs to the technical field of information extraction of natural language, and particularly relates to a semi-open type multi-language entity relation extraction method and system without marking data and utilizing dependency analysis. Background The continual advances in networking technology have made it easy for people to access vast amounts of text content, such as news, books, literature, and the like. How to efficiently and accurately acquire information of interest to a user has become an important point of research. In face of this problem, information extraction (Information Extraction, IE) tasks have evolved. Its main objective is to extract specific event or fact information from natural language text, and to automatically classify, extract and reconstruct massive content. Such information typically includes entities (entities), relationships (events). Relationship extraction (Relation Extraction, RE) is one of the key tasks of information extraction for identifying and classifying semantic relationships between entities, such as binary relationships of children, couples, positional relationships, etc. The task is widely applied to various fields such as text mining, search engines, knowledge graph construction and the like. Currently, with the continuous development of natural language processing technology, a relational extraction method based on a language big model (Language Large Model, LLM) is becoming the mainstream, and shows good performance. However, such methods are typical data driven methods, and require a large amount of initial data to initialize, so that the relation extraction task facing the brand new field always faces the difficult problem of data deletion. Therefore, it is necessary to construct a fast-start, widely applicable relation extraction method that can be initiated without the need for labeling data. 1. Semi-open entity relationship extraction The relationship extraction may be categorized according to whether entity categories and relationship categories are restricted, defined domain, open, and semi-open relationship extraction, respectively. Semi-open entity relationship extraction is a method that is intermediate between defined domain and open relationship extraction. The finite field relation extraction means that the type of relation and the category of entity are defined in advance, and the algorithm only performs relation extraction in a given finite field. This approach is suitable for certain field tasks such as drug-disease relationship extraction in the medical literature. Since the relationship type and entity category are predetermined, the defined domain relationship extraction can extract relationship information in a specific domain more accurately, but cannot extract for relationships outside the definition. Open relationship extraction refers to the fact that the relationship type and entity category are not limited, and the algorithm can automatically discover various relationships between different entities from the text. The method generally uses an unsupervised learning or semi-supervised learning technology, and can adapt to relation extraction tasks of different fields and corpuses. The open relationship extraction has the advantage that it can discover new relationship types or migrate between different domains, while the disadvantage that the result of the open relationship extraction may be of interest to the user only in part due to the too few restrictions. Semi-open physical relationship extraction is a compromise between defined domain and open relationship extraction. In the semi-open relationship extraction, the category of the entity is usually defined in advance, but the relationship type can be open, i.e. the unknown relationship type can be automatically found. This approach balances accuracy and adaptability in the relationship extraction task. It is applicable to various relationships between specific entities that need to be extracted in a specific domain while retaining some flexibility to handle unknown relationship types. Currently, the semi-open entity relationship extraction is implemented by deep learning technology, and is rarely researched, and as mentioned above, the method faces the difficult problem of data deletion. 2. Relation extraction method based on dependency analysis Dependency analysis (DEPENDENCY PARSING, DP) is an important subtask in natural language processing, which aims to identify semantic dependencies between words in sentences. The dependency analysis-based relationship extraction method extracts association information between entities from a text using the dependencies. Currently, there are mature dependency analysis tools in both Chinese and English, including the LTP language technology platform of Hadamard, NLTK kits of the university of pennsylvania, coreNLP kits