Search

CN-115879453-B - Entity identification and relation extraction method integrating vocabulary boundary and semantic information

CN115879453BCN 115879453 BCN115879453 BCN 115879453BCN-115879453-B

Abstract

The invention relates to the technical field of natural language processing, in particular to an entity recognition and relation extraction method integrating vocabulary boundaries and semantic information, which comprises the steps of constructing sample input and labels of a preprocessing language model; the method comprises the steps of outputting a feature vector of a last layer through a BERT model, constructing a task feature vector of an entity recognition task, calculating loss, constructing task feature vector calculation loss related to a relation extraction task, adding two losses according to coefficients to obtain total loss, and combining results of entity recognition and relation extraction to obtain a final triplet. The method solves the problems of error accumulation, entity redundancy and interaction deletion in the deep learning method and the problems of extraction of all possible segment arrangements displayed by the existing nested entity based on the segment arrangement mode.

Inventors

  • Zhou Huanyue
  • XU SHOUKUN
  • YUAN YANG
  • SHI LIN
  • ZHANG HUAJUN
  • ZHUANG JIA

Assignees

  • 常州大学

Dates

Publication Date
20260505
Application Date
20221117

Claims (5)

  1. 1. A method for entity recognition and relation extraction of fusion vocabulary boundary and semantic information is characterized by comprising the following steps: step one, constructing sample input and labels of a preprocessing language model; Inputting the sample into a pre-training BERT model, and outputting the feature vector of the last layer through the BERT model; Step three, constructing task feature vectors of entity identification tasks The feature vector is obtained by splicing the feature after sentence maximization, the predicted fragment boundary word element feature and the boundary feature spliced at the tail of the sample Sending the result to NER classifier to obtain classification result and calculating loss ; The third step specifically comprises: step 31, carrying out maximum pooling on the word vector characteristic information to obtain The calculation formula is as follows: ; step 32, splicing part of the feature vectors to obtain task feature vectors The calculation formula is as follows: , wherein cat denotes a conccate operation, Representing the first position information characteristic of the predicted entity fragment, Representing the tail position information characteristics of the predicted entity fragment, The first position information feature of the segment to be predicted added to the end of the sample, Identifying m candidate fragments once by combining tail position information features of fragments to be predicted added to the tail of a sample; Step 33, will Sending the obtained entity types into an NER classifier to obtain the entity types of Is the prediction result of (2) The formula is: , Wherein, the Trainable parameters of the task model expressed as entity extraction and relationship extraction, Represented as a type of entity and, Represented as a set of entity types; step 34, calculating the cross entropy loss of the NER portion The formula is: , Wherein, the Indicating whether it is the current category; Step four, constructing task feature vectors related to relation extraction tasks Is obtained by splicing sentence vectors, boundary characteristics of a main body segment to be predicted and boundary characteristics of an object segment to be predicted, sending the data to an RE classifier to obtain a classification result and calculating loss ; Step five, loss And loss of Adding according to coefficients to obtain total loss ; The fourth step specifically comprises: Step 41, splicing part of the feature vectors to obtain task feature vectors The formula is: ], wherein cat is denoted as the con cate operation, A [ CLS ] sentence feature vector expressed as an output of the pre-training language model; Represented as the head position feature of the body segment a, Represented as the tail position feature of the body segment a, Represented as a first position feature of the candidate object fragment, Tail position features expressed as candidate guest fragments; Step S42 of Sending the object fragment into an RE classifier to obtain a relationship type between the object fragment b and the subject fragment a as follows Is the prediction result of (2) The formula is: , Wherein, the Represented as a parameter that the model can be trained on, Represented as a type of relationship and, A set of relationship types; Step 43, calculating cross entropy loss of RE portion The formula is: , Wherein, the The text sentence after representing the word segmentation contains n word elements in total, Indicating whether it is the current category; And step six, combining entity identification and relation extraction results to obtain triples.
  2. 2. The method for entity recognition and relation extraction of fusion vocabulary boundary and semantic information according to claim 1, wherein the step one specifically comprises: Step 11, word segmentation is carried out on the text sentence, and [ CLS ] symbols are added to the text sentence after word segmentation to obtain a sequence { [ CLS ], T 1 ,T 2 ,T 3 ,T i ,...,T n }, wherein T i is expressed as a word element Token obtained after word segmentation of the text sentence; Step 12, the end combination of the text sentence is represented as {[CLS];T 1 ,T 2 ,T 3 ,...,T n ;S 1 ,S 1 ,...,S 1 ;S 1 ,S 2 ,S 3 ,...,S m }; by m pieces of fragments to be predicted, wherein { S 1 ,S 1 ,...,S 1 } represents the first position information of the added fragments to be predicted, { S 1 ,S 2 ,S 3 ,...,S m } represents the last position information of the added fragments to be predicted, until all pieces of fragment position information S 1 ~S n are traversed, and the position information of the added fragments to be predicted shares the position information with the corresponding word elements in the text, so that Z pieces of spliced fragments to be predicted are obtained altogether; And 13, constructing an entity tag and a relationship tag, wherein the entity tag consists of entity boundary information and entity type tag information, and the relationship tag consists of boundary information of a host-guest entity pair and the relationship type tag.
  3. 3. The method for entity recognition and relation extraction of fusion vocabulary boundary and semantic information according to claim 2, wherein the calculation formula of the Z spliced segments to be predicted is: ; Where L represents the length of the fragment to be predicted.
  4. 4. The method of claim 1, wherein the feature vectors in the second step comprise word vectors and sentence vectors.
  5. 5. The method for entity recognition and relation extraction of fusion vocabulary boundary and semantic information according to claim 1, wherein the total loss in the fifth step The formula of (2) is: , Wherein α, β are dynamic weights.

Description

Entity identification and relation extraction method integrating vocabulary boundary and semantic information Technical Field The invention relates to the technical field of natural language processing, in particular to an entity identification and relation extraction method integrating vocabulary boundaries and semantic information. Background Entity recognition and relationship extraction are important tasks in natural language processing, and are responsible for recognizing entities from natural language texts and extracting semantic relationships between the entities. Entity identification and relation extraction under a pipeline method based on deep learning means that a plurality of different entities in sentences are identified first, then the identified entities are respectively combined for relation type judgment, and the front process and the rear process are completely separated; the method has the problems of error accumulation, entity redundancy, interaction deficiency and the like, and the problems can be effectively relieved based on a joint extraction mode. In the case of nested entities, based on the way the segment arrangement is displayed, all possible segment arrangements are extracted, and since each segment selected is independent, the feature at the segment level can be directly extracted to solve the problem. Disclosure of Invention Aiming at the defects of the existing algorithm, the invention considers that a certain constraint exists between the output of the entity identification model and the relation extraction model, adopts a joint extraction mode to identify the semantic relation between the entity and the entity pair, integrates vocabulary boundaries and related semantic information, fully utilizes the position relation information among different words in the sentences, simultaneously combines sample sentences into the form of entity fragments to be predicted before inputting the pre-training language model, can simultaneously identify a plurality of entity fragments to be predicted at one time, improves the model calculation efficiency, can effectively solve the entity nesting condition based on the fragment classification form, can effectively identify the entity and the type thereof by utilizing the entity identification and relation extraction method, accurately reveals the semantic relation among the entities, and provides effective assistance for constructing a knowledge graph, an intelligent question-answering system and the like. The technical scheme adopted by the invention is that the entity identification and relation extraction method integrating vocabulary boundaries and semantic information comprises the following steps: step one, constructing sample input and labels of a preprocessing language model; further, the method specifically comprises the following steps: Step 11, word segmentation is carried out on the text sentence, and [ CLS ] symbols are added to the text sentence after word segmentation to obtain a sequence { [ CLS ], T 1,T2,T3,Ti...,Tn }, wherein T i is expressed as a word element Token obtained after word segmentation of the text sentence; Step 12, the end combination of the text sentence is represented as {[CLS];T1,T2,T3,...,Tn;S1,S1,...,S1;S1,S2,S3,...,Sm}; by m pieces of fragments to be predicted, wherein { S 1,S1,...,S1 } represents the first position information of the added fragments to be predicted, { S 1,S2,S3,...,Sm } represents the last position information of the added fragments to be predicted, until all pieces of fragment position information S 1~Sn are traversed, and the position information of the added fragments to be predicted shares the position information with the corresponding word elements in the text, so that Z pieces of spliced fragments to be predicted are obtained altogether; further, the calculation formula of the Z spliced fragments to be predicted is as follows: Wherein L represents the length of the segment to be predicted, N represents the total N word elements in the segmented text sentence. Step 13, constructing an entity tag and a relationship tag, wherein the entity tag consists of entity boundary information and entity type tag information, and the relationship tag consists of boundary information of a host-guest entity pair and the relationship type tag; Inputting the sample into a pre-training BERT model, and outputting the feature vector of the last layer through the BERT model; further, the feature vector includes a word vector and a sentence vector. Step three, constructing task feature vectors of entity identification tasksThe feature vector is obtained by splicing the feature after sentence maximization, the predicted fragment boundary word element feature and the boundary feature spliced at the tail of the sampleSending the result to NER classifier to obtain classification result and calculating loss Further, the method specifically comprises the following steps: Step 31, carrying out maximum pooling