CN-122021632-A - Universal entity extraction method, medium and equipment

CN122021632ACN 122021632 ACN122021632 ACN 122021632ACN-122021632-A

Abstract

The invention relates to the technical field of entity extraction, in particular to a general entity extraction method, medium and device, which are used for converting an isolated entity name into an attribute and a structured refined name of the name by adding corresponding semantic attributes to the name of an initial entity, providing richer semantic information for the entity, expanding and iteratively updating the context corresponding to the initial entity, which is not passed by semantic uniqueness verification, supplementing more semantic details, enabling the newly generated refined entity name to have stronger semantic distinction degree, solving the problem of semantic conflict caused by insufficient information, ensuring that the finally output target entity name has strict semantic uniqueness, thoroughly solving the inherent defect of the initial entity, constructing and training a corresponding target entity extraction model for each entity name set, enabling the model to focus on only a certain type of entity with specific semantic, and greatly improving the entity extraction effect aiming at the specific semantic attribute.

Inventors

Dai Shaohang
Zheng Meizan
YANG CAN

Assignees

江阴云深科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260126

Claims (10)

1. The utility model provides a general entity extraction method, which is characterized in that the general entity extraction method comprises the following steps: s1, extracting a context text of each initial entity in an original text according to a context range corresponding to each initial entity extracted from the original text, wherein the problem of entity name homogenization or entity semantic ambiguity exists among the initial entities; S2, adding corresponding semantic attributes to the names of the initial entities according to the context text corresponding to each initial entity, and obtaining refined entity names corresponding to each initial entity; S3, carrying out semantic uniqueness verification on all the refined entity names, expanding a context range corresponding to the initial entity which is not passed through verification, updating the corresponding refined entity names according to the context text corresponding to the expanded refined entity names until all the refined entity names corresponding to the initial entity are passed through verification, and obtaining a target entity name with the corresponding semantic uniqueness of each initial entity, wherein the problem of entity name homogenization or entity semantic ambiguity does not exist among the target entity names; S4, grouping the target entity names according to semantic attributes corresponding to all the target entity names to obtain a plurality of entity name sets; S5, constructing and training a corresponding initial entity extraction model for each entity name set, and obtaining a target entity extraction model corresponding to each entity name set, wherein the target entity extraction model is used for extracting entities with specific semantic attributes from texts.
2. The general entity extraction method according to claim 1, wherein S1 comprises the steps of: S11, determining a text interception rule corresponding to each initial entity according to the position coordinates of each initial entity in the original text and the structural characteristics of the original text, wherein the structural characteristics comprise paragraph boundary marks and punctuation marks; s12, acquiring a context range corresponding to each initial entity according to the text interception rule; S13, according to the context range, intercepting and extracting the context text corresponding to each initial entity from the original text.
3. The method for extracting a generic entity according to claim 2, wherein the obtaining, according to the text interception rule, the context range corresponding to each initial entity includes at least one of the following obtaining modes: The first acquisition mode is that aiming at any initial entity, if the current initial entity is positioned in a sentence, the sentence in which the current initial entity is positioned, the first M sentences and the last M sentences adjacent to the sentence in which the current initial entity is positioned are used as context ranges corresponding to the current initial entity, wherein M is an integer larger than 0; The second acquisition mode is that for any initial entity, if the current initial entity is positioned at the beginning and the end of a paragraph, the paragraph where the current initial entity is positioned is taken as the context range corresponding to the current initial entity; The third obtaining mode is that for any initial entity, if the original text contains a domain identifier related to the current initial entity, sentences containing the domain identifier are used as core sentences, and the range of the initialized sentence group is used as the core sentences; Taking the core sentence as a starting point, respectively selecting a first adjacent sentence from the front direction and the rear direction of the core sentence, and calculating the first semantic similarity of each adjacent sentence and the core sentence; If the first semantic similarity is greater than or equal to a first preset similarity threshold, the corresponding adjacent sentences are brought into the sentence group range, the corresponding adjacent sentences are taken as new starting points, the next adjacent sentences are continuously selected towards the corresponding preface direction or the subsequent direction, semantic similarity calculation and inclusion judgment are repeated until the first semantic similarity is smaller than the first preset similarity threshold, and expansion towards the corresponding preface direction or the subsequent direction is stopped; Integrating all sentences which are expanded by the preamble direction and the postamble direction and then brought into the sentence group range to form a sentence group containing the domain identification words, and taking the sentence group containing the domain identification words as the context range corresponding to the current initial entity.
4. The general entity extraction method according to claim 3, wherein S3 comprises the steps of: S31, converting each refined entity name and the corresponding context text into a semantic vector; s32, calculating second semantic similarity between semantic vectors corresponding to any two refined entity names; S33, if the second semantic similarity is greater than or equal to a second preset similarity threshold, judging that the verification is not passed, and marking the corresponding two refined entity names as entities to be optimized; S34, expanding the context range of the entity to be optimized, and returning to execute the step S2 based on the expanded context text to obtain a new refined entity name corresponding to each entity to be optimized; And S35, if the second semantic similarity among all the refined entity names is lower than the second preset similarity threshold value, determining that all the refined entity names corresponding to the initial entities pass the verification, determining the refined entity name corresponding to each initial entity when passing the verification as the corresponding target entity name, and otherwise, returning to the step S31.
5. The method for extracting a generic entity according to claim 4, wherein the expanding the context range of the entity to be optimized comprises at least one of the following expansion modes: The first expansion mode is that for any initial entity, if the current initial entity is positioned in a sentence, the front N sentences and the rear N sentences which are adjacent to each other in a context range corresponding to the current initial entity are brought into a new context range corresponding to the current initial entity, wherein N is an integer larger than 0; the second expansion mode is that for any initial entity, if the current initial entity is positioned at the beginning and the end of the paragraphs, the first P paragraphs and the last P paragraphs adjacent to the context range corresponding to the current initial entity are brought into a new context range corresponding to the current initial entity, wherein P is an integer larger than 0; the third expansion mode is that for any initial entity, if the original text contains a domain identifier related to the current initial entity, a preset value is subtracted from the first preset similarity threshold to be used as a new first preset similarity threshold; And carrying out inclusion judgment according to the new first preset similarity threshold value, and obtaining a new context range corresponding to the current initial entity.
6. The general entity extraction method according to claim 1, wherein S5 comprises the steps of: S51, aiming at any entity name set, collecting field scene text data corresponding to the current entity name set by taking a target entity name in the current entity name set as a labeling type; S52, entity labeling is carried out on the field scene text data, wherein labeling content comprises text quantity of each target entity name in the field scene text data in a current entity name set; S53, aiming at any target entity name in the current entity name set, if the text quantity corresponding to the current target entity name is lower than a preset sample threshold value, generating a supplementary text containing the current target entity name through a preset large model, wherein the semantic style of the supplementary text is consistent with the field scene text data, and the sum of the text quantity corresponding to the current target entity name and the quantity of the supplementary text is not lower than the preset sample threshold value; S54, combining the field scene text data with the supplementary texts corresponding to all the target entity names to form a dedicated training data set corresponding to the current entity name set; S55, training the corresponding initial entity extraction model according to the exclusive training data set corresponding to each entity name set, and obtaining the target entity extraction model corresponding to each entity name set.
7. The general entity extraction method according to claim 6, wherein S55 comprises the steps of: s551, dividing the exclusive training data set corresponding to the current entity name set into a training set and a verification set according to a preset proportion aiming at any entity name set; S552, training the initial entity extraction model corresponding to the current entity name set in a preset training round by taking entity extraction accuracy, recall rate and/or F1 value as evaluation indexes according to the training set; S553, after completing a preset training round, evaluating the performance of the model by using the verification set; S554, if the numerical value lifting proportion of the model performance of the continuous preset round is lower than the preset proportion threshold value and the numerical value of the model performance is higher than the preset performance threshold value, stopping training to obtain a target entity extraction model corresponding to the current entity name set, otherwise, repeating the step S552.
8. The general entity extraction method according to claim 1, wherein S2 comprises the steps of: S21, extracting character attributes, field attributes and scene attributes corresponding to each initial entity according to the context text corresponding to each initial entity; S22, splicing non-empty semantic attribute values by a predefined connector according to a preset priority rule to form attribute combination strings corresponding to each initial entity, wherein the priority rule is that the attribute priority of a role is higher than the attribute priority of a scene, the attribute priority of the scene is higher than the attribute priority of the field, and the loss of the attribute with high priority is not filled by the attribute with low priority; S23, the attribute combination character string corresponding to each initial entity is used as a prefix/suffix and added to the name of the corresponding initial entity, so that a refined entity name corresponding to each initial entity is formed.
9. A non-transitory computer readable storage medium having at least one instruction or at least one program stored therein, wherein the at least one instruction or the at least one program is loaded and executed by a processor to implement the generic entity extraction method of any one of claims 1-8.
10. An electronic device comprising a processor and the non-transitory computer readable storage medium of claim 9.

Description

Universal entity extraction method, medium and equipment Technical Field The present invention relates to the field of entity extraction technologies, and in particular, to a general entity extraction method, medium, and apparatus. Background In the field of natural language processing, entity extraction is a core task of information extraction, aiming at identifying and classifying named entities, such as person names, place names, organization names, and the like, from unstructured text. Conventional entity extraction methods, including rule-based, statistical machine learning, and deep learning methods, typically rely on a predefined, flat collection of entity types. However, the prior art has an inherent disadvantage of the homogeneity of entity names and semantic ambiguity. Specifically, the existing method is only based on the classification of entity literal names, and cannot effectively distinguish deep semantic differences of the same name under different context scenes. For example, in cross-domain business text, the generic type "name" may correspond to the types of different types of people in different reports. The existing entity extraction model can uniformly classify the names of all different semantic roles into PERs (characters), so that the extraction result is literally correct, but the key business semantic information is lost, and the accurate requirements of business systems such as downstream case analysis, risk management and control and the like on entity fine granularity semantics cannot be met. Therefore, how to solve the notification and semantic ambiguity problems of entity names in entity extraction tasks is a urgent problem to be solved. Disclosure of Invention Aiming at the technical problems, the technical scheme adopted by the invention is a general entity extraction method, which comprises the following steps: s1, extracting the context text of each initial entity in the original text according to the context range corresponding to each initial entity extracted from the original text, wherein the problem of entity name homogenization or entity semantic ambiguity exists between the initial entities. S2, adding corresponding semantic attributes to the names of the initial entities according to the context text corresponding to the initial entities, and obtaining the refined entity names corresponding to the initial entities. S3, carrying out semantic uniqueness verification on all the refined entity names, expanding a context range corresponding to the initial entity which is not passed through verification, updating the corresponding refined entity names according to the context text corresponding to the expanded refined entity names until all the refined entity names corresponding to the initial entity are passed through verification, and obtaining a target entity name with the corresponding semantic uniqueness of each initial entity, wherein the problem of entity name homogenization or entity semantic ambiguity does not exist among the target entity names. S4, grouping the target entity names according to semantic attributes corresponding to all the target entity names to obtain a plurality of entity name sets. S5, constructing and training a corresponding initial entity extraction model for each entity name set, and obtaining a target entity extraction model corresponding to each entity name set, wherein the target entity extraction model is used for extracting entities with specific semantic attributes from texts. The invention also provides a non-transitory computer readable storage medium, wherein at least one instruction or at least one section of program is stored in the non-transitory computer readable storage medium, and the at least one instruction or the at least one section of program is loaded and executed by a processor to realize the general entity extraction method. The invention also provides an electronic device comprising a processor and the non-transitory computer readable storage medium described above. The method has the advantages that through extracting semantic attributes from a context text, adding corresponding semantic attributes to the names of each initial entity, the method realizes the conversion of isolated entity names into structured refined names of the attributes and the names, provides richer semantic information for the entities, thereby providing semantic basis for solving the problem of homogeneity and semantic ambiguity of the entity names, and through expanding and iteratively updating the context corresponding to the initial entity with failed semantic uniqueness verification, supplementing more semantic details for the initial entity with failed verification, the newly generated refined entity names have stronger semantic distinction degree, solve the problem of semantic conflict caused by insufficient information, ensure that the finally output target entity names have strict semantic uniqueness, thoroughly solve the inherent