CN-113886571-B - Entity identification method, entity identification device, electronic equipment and computer readable storage medium

CN113886571BCN 113886571 BCN113886571 BCN 113886571BCN-113886571-B

Abstract

The application provides an entity identification method, an entity identification device, electronic equipment and a computer readable storage medium, wherein the method comprises the steps of obtaining at least one entity boundary word corresponding to a text sequence to be identified, obtaining at least one entity candidate area in the text sequence to be identified based on the at least one entity boundary word, and obtaining an entity identification result of the text sequence to be identified based on the entity candidate area. The steps in this scheme may be performed by an artificial intelligence model. Compared with the prior art, the method and the device have the advantages that the coverage rate of the entity candidate region for the entity in the text sequence to be identified can be improved on the premise that the number of the entity candidate region is not increased, and the complexity of calculation is reduced.

Inventors

WANG HUADONG
CHEN TING

Assignees

北京三星通信技术研究有限公司
三星电子株式会社

Dates

Publication Date: 20260508
Application Date: 20210604
Priority Date: 20200701

Claims (19)

1. A method of entity identification, comprising: Acquiring at least one entity boundary word corresponding to a text sequence; Using a corresponding entity boundary word in the at least one entity boundary word as an anchor word, and determining an entity suggestion region corresponding to the text sequence based on the at least one entity boundary word; Based on the association relation between the entity suggestion region and each word in the text sequence, adjusting the boundary of the entity suggestion region; Determining at least one entity candidate region in the text sequence based on the adjusted boundary of the entity suggestion region; and carrying out entity recognition on the text sequence based on the at least one entity candidate region to acquire at least one entity in the text sequence.
2. The method of claim 1, wherein the obtaining at least one entity-boundary word corresponding to the text sequence comprises: respectively using all words in the text sequence as entity boundary words, or Based on the background representation vector of the words in the text sequence, the probability that the words in the text sequence are used as entity boundary words is obtained, and based on the probability, the entity boundary words of the text sequence are determined.
3. The method of claim 1, wherein the determining the entity suggestion region corresponding to the text sequence comprises: and determining that the entity suggestion region comprises at least one preset width relative to the anchor word.
4. The method of claim 1, wherein said adjusting the boundary of the entity suggestion region comprises: acquiring a corresponding combined vector based on the background representation vector of the word covered by the entity suggestion region and the background representation vector of the corresponding anchor point word; obtaining the similarity between the background expression vector of the entity boundary word in the text sequence and the combined vector; and adjusting the boundary of the entity suggestion region based on the similarity.
5. The method of claim 4, further comprising: And determining a corresponding entity candidate region based on the adjusted boundary of the entity suggestion region, wherein the start boundary word and the end boundary word of the entity candidate region are determined based on the adjusted boundary of the entity suggestion region.
6. The method of claim 4, wherein the obtaining the similarity between the background representation vector of the entity boundary word in the text sequence and the combined vector comprises: and in European space or hyperbolic space, acquiring the similarity between the background representation vector of the entity boundary word in the text sequence and the combined vector.
7. The method of claim 5, wherein the determining the corresponding entity candidate region comprises: determining a start boundary word of a corresponding entity candidate region from at least one anchor word of an entity suggestion region in the text sequence and an entity boundary word positioned at the left side of the anchor word based on the similarity, and determining a stop boundary word of the corresponding entity candidate region from at least one anchor word of the entity suggestion region and an entity boundary word positioned at the right side of the anchor word in the text sequence; and determining a corresponding entity candidate region based on the starting boundary word and the ending boundary word.
8. The method of claim 4, wherein the obtaining the corresponding combined vector based on the background representation vector of the word covered by the entity suggestion region and the background representation vector of the corresponding anchor word comprises: Taking the width of the entity suggestion region as the width of a convolution kernel, and carrying out convolution processing on the background representation vector of the word covered by the entity suggestion region to obtain a corresponding feature vector; And acquiring a corresponding combined vector based on the feature vector corresponding to the word covered by the entity suggestion region and the background representation vector of the corresponding anchor point word.
9. The method of claim 5, wherein the determining the corresponding entity candidate region comprises: determining at least one initial boundary word candidate and at least one termination boundary word candidate of anchor words of the entity suggestion region; Determining a start boundary word of the entity suggestion region in at least one start boundary word candidate, and determining a stop boundary word of the entity suggestion region in at least one stop boundary word candidate; and determining a corresponding entity candidate region according to the obtained starting boundary word and ending boundary word.
10. The method of claim 9, wherein determining at least one start boundary word candidate and at least one end boundary word candidate for an anchor word of the entity suggestion region comprises: Determining an anchor word of the entity suggestion region and an entity boundary word positioned at the left side of the anchor word as at least one initial boundary word candidate of the anchor word; And determining an anchor point word of the entity suggestion region and an entity boundary word positioned on the right side of the anchor point word as at least one termination boundary word candidate of the anchor point word.
11. The method of claim 9, wherein determining a start boundary word for the entity suggestion region in at least one start boundary word candidate and determining a stop boundary word for the entity suggestion region in at least one stop boundary word candidate comprises: determining a first probability that each initial boundary word candidate in the at least one initial boundary word candidate is used as the initial boundary word of the entity suggestion region, and a second probability that each termination boundary word candidate in the at least one termination boundary word candidate is used as the termination boundary word of the entity suggestion region; and determining a start boundary word of the entity suggestion region based on the first probability, and determining a stop boundary word of the entity suggestion region according to the second probability.
12. The method according to any one of claims 1-11, wherein the performing entity recognition on the text sequence based on the at least one entity candidate region to obtain at least one entity in the text sequence comprises: screening the at least one entity candidate region to obtain at least one screened entity candidate region; And judging the category of each screened entity candidate region to obtain an entity identification result of the text sequence.
13. The method of claim 12, wherein the screening the at least one candidate region for the entity to obtain at least one screened candidate region for the entity comprises: Acquiring a corresponding first classification feature vector based on a background representation vector of a word covered by at least one entity candidate region; based on the corresponding first classification feature vector, obtaining the probability that the entity candidate region belongs to the entity; and acquiring at least one screened entity candidate region based on the probability that the entity candidate region belongs to the entity.
14. The method according to claim 12 or 13, wherein the classifying each screened entity candidate region to obtain the entity recognition result of the text sequence includes: Acquiring corresponding second classification feature vectors based on the background representation vectors of the start boundary words and the end boundary words corresponding to each screened entity candidate region; and carrying out category discrimination on the at least one screened entity candidate region based on the second category feature vector to obtain an entity identification result of the text sequence.
15. The method according to any one of claims 1-11, wherein the performing entity recognition on the text sequence based on the at least one entity candidate region to obtain at least one entity in the text sequence comprises: Acquiring a corresponding third classification feature vector based on a background representation vector of a start boundary word and a stop boundary word corresponding to at least one entity candidate region; and carrying out category discrimination on at least one entity candidate region based on the third classification feature vector to obtain an entity recognition result of the text sequence.
16. The method according to claim 1 or 2, wherein said obtaining at least one entity candidate region in the text sequence based on the at least one entity boundary word comprises: acquiring a preset number of entity boundary words adjacent to at least one entity boundary word from the text sequence; obtaining the similarity between the background representation vectors of the entity boundary words and the corresponding background representation vectors of the adjacent preset number of entity boundary words respectively; based on the similarity, a corresponding entity candidate region is obtained.
17. The method of claim 16, wherein the obtaining the corresponding entity candidate region based on the similarity comprises: Based on the similarity, determining a start boundary word and a stop boundary word of a corresponding entity candidate region from entity boundary words of the text sequence and adjacent preset number of entity boundary words of the entity boundary words respectively; and determining a corresponding entity candidate region based on the starting boundary word and the ending boundary word.
18. An electronic device comprising a memory and a processor; the memory stores a computer program; The processor for executing the computer program to implement the method of any one of claims 1 to 17.
19. A computer readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the method of any of claims 1 to 17.

Description

Entity identification method, entity identification device, electronic equipment and computer readable storage medium Technical Field The present application relates to the field of computer technologies, and in particular, to an entity identification method, an entity identification device, an electronic device, and a computer readable storage medium. Background Entity recognition is mainly used for extracting all candidate entities possibly being entities from a section of text sequence to be recognized and judging entity types. Nested entity recognition, namely that the entity in the text sequence to be recognized possibly has nesting condition, wherein the nested entity recognition requires that all candidate entities in the input text sequence are recognized, but not only the outermost candidate entity, and the traditional method based on sequence labeling can only allocate one label for each word, so that the traditional entity recognition method is necessary to be optimized. Disclosure of Invention The application aims to at least solve one of the technical defects, and the technical scheme provided by the embodiment of the application is as follows: in a first aspect, an embodiment of the present application provides an entity identification method, including: Acquiring at least one entity boundary word corresponding to a text sequence to be identified; Acquiring at least one entity candidate region in a text sequence to be identified based on at least one entity boundary word; and acquiring an entity recognition result of the text sequence to be recognized based on the entity candidate region. In an optional embodiment of the application, obtaining at least one entity boundary word corresponding to the text sequence to be recognized includes: Respectively using all words in the text sequence to be identified as entity boundary words, or Based on the background representation vector of the word in the text sequence to be recognized, the probability that the word in the text sequence to be recognized is used as the entity boundary word is obtained, and based on the probability, the entity boundary word of the text sequence to be recognized is determined. In an alternative embodiment of the present application, based on at least one entity boundary word, obtaining at least one entity candidate region in a text sequence to be recognized includes: Based on the entity boundary words, acquiring entity suggestion areas corresponding to the text sequences to be identified; based on the entity suggestion region, a corresponding entity candidate region is obtained. In an optional embodiment of the present application, obtaining an entity suggestion region corresponding to a text sequence to be identified based on entity boundary words includes: Based on at least one preset width, respectively taking entity boundary words as anchor words, and acquiring corresponding entity suggestion regions with at least one preset width. In an alternative embodiment of the present application, based on the entity suggestion region, obtaining a corresponding entity candidate region includes: acquiring a corresponding combined vector based on the background representation vector of the word covered by the entity suggestion region and the background representation vector of the corresponding anchor point word; Obtaining the similarity between the background representation vector and the combined vector of the entity boundary words in the text sequence to be identified; based on the similarity, a corresponding entity candidate region is obtained. In an alternative embodiment of the present application, obtaining similarity between a background representation vector and a combination vector of entity boundary words in a text sequence to be recognized includes: And obtaining the similarity between the background representation vector and the combined vector of the entity boundary words in the text sequence to be recognized in the European space or the hyperbolic space. In an alternative embodiment of the present application, obtaining a corresponding entity candidate region based on similarity includes: Based on the similarity, determining a start boundary word of a corresponding entity candidate region from anchor words of an entity suggestion region in the text sequence to be recognized and entity boundary words positioned on the left side of the anchor words, and determining a stop boundary word of the corresponding entity candidate region from the anchor words of the entity suggestion region in the text sequence to be recognized and the entity boundary words positioned on the right side of the anchor words; And determining the corresponding entity candidate region based on the start boundary word and the end boundary word. In an alternative embodiment of the present application, obtaining a corresponding combined vector based on a background representation vector of a word covered by an entity suggestion region and a backgroun