CN-122021631-A - Entity disambiguation method, device, equipment and storage medium based on rule screening and large model assistance

CN122021631ACN 122021631 ACN122021631 ACN 122021631ACN-122021631-A

Abstract

The invention discloses a method, a device, equipment and a storage medium for entity disambiguation based on rule screening and large model assistance, which introduces a rule screening mechanism aiming at the actual situation of limited throughput of a large model, the candidate entity is subjected to preliminary filtration through a lightweight algorithm, and then the rule screening result is subjected to refined judgment by utilizing a large model, so that the overall disambiguation efficiency and accuracy are improved, and the problem of insufficient entity identification efficiency and accuracy under a multi-language hybrid scene is solved. The cross-semantic consistency processing of names, pinyin, translated names and mechanism fields is realized, and a stable and extensible basis is provided for academic talent portraits and intelligent recommendation.

Inventors

WANG YUKAI
LIN TIANSHU
SUN YUTAO

Assignees

大连理工大学

Dates

Publication Date: 20260512
Application Date: 20260109

Claims (10)

1.A method of entity disambiguation based on rule screening and large model assistance, the method comprising: extracting entity attributes with or related to the non-disambiguated entity names from the text or the bibliographic items as original entities; normalizing the original entity set to obtain a normalized entity set; Performing similarity calculation on any two standardized entities in the standardized entity set, and classifying the standardized entities according to the similarity and a preset similarity threshold value to obtain a suspected ambiguity set and form a first non-ambiguity set; Carrying out entity commonality feature extraction on standardized entities in the suspected ambiguity set, and inducing to form an priori constraint of entity disambiguation; Encoding the prior constraints into prompt templates respectively, and adjusting the generation control parameters of the reasoning stage to form a plurality of prompt engineering models; Selecting an optimal large model from a plurality of prompt engineering models based on a verification set, wherein the verification set is a subset of a suspected ambiguity set; And finally disambiguating the suspected ambiguity set through the optimal large model to obtain a non-ambiguity set, wherein the non-ambiguity set comprises disambiguation entity names.
2. The method of rule-based screening and large model aided entity disambiguation of claim 1, wherein said extracting entity attributes with or associated with non-disambiguated entity names from text or bibliographic items is performed as original entities, wherein the plurality of original entities form an original entity set Comprising the following steps: The method comprises the steps of obtaining text information and learner information from the text information, wherein the text information comprises data such as a scientific research paper database, a project declaration resume, patent bibliographic information, scientific research rewards and the like of a scientific research project system, the learner information comprises a non-disambiguated entity name and entity attributes related to the non-disambiguated entity name, the non-disambiguated entity name comprises a learner name, and the entity attributes comprise original text data of information such as a affiliated mechanism, a research direction, a geographic position and the like; the undelayed entity names and entity attributes form original entities, and a plurality of original entities form an original entity set , wherein, Representing the original entity comprises a non-disambiguated entity name and entity attributes, wherein the entity attributes comprise an affiliated mechanism, a research direction, a geographic position and the like, and the original entity is represented as follows: Wherein, the Representing the name of the disambiguated entity of the ith learner, Indicating the mechanism to which the learner belongs, The direction of the study by the learner is indicated, Representing the geographic location of the learner.
3. The rule-based screening and large model aided entity disambiguation method of claim 1, wherein said pair of original entity sets Standardized to obtain a standardized entity set ; Based on the original entity set The method comprises the steps of carrying out unified formatting processing on the undemanding entity names and entity attributes, including removing space, symbols and control characters, and executing operations such as full-angle half-angle, case and complex and simple unified operation; Aiming at the mixed situations of Chinese, english and pinyin, a Chinese-English-pinyin conversion tool is utilized to convert the un-disambiguated entity name in a Chinese form into the un-disambiguated entity name in a pinyin form, the affiliated mechanism in the Chinese form, the research direction in the Chinese form and the geographic position in the Chinese form are converted into English, and unified processing is carried out on the un-disambiguated entity name in the English form or the un-disambiguated entity name in the pinyin form, namely, symbols and texts irrelevant to the name are removed, so that a standardized entity set is obtained: Wherein, the In order to standardize the set of entities, Is a standardized entity.
4. The rule-based screening and large model aided entity disambiguation method of claim 3, wherein said pair of normalized entity sets The method comprises the steps of performing similarity calculation on any two standardized entities, and classifying the standardized entities according to the similarity, namely a preset similarity threshold value, to obtain a suspected ambiguity set And forming a first non-ambiguous set Comprising the following steps: For standardized entity sets Any two standardized entities in (a) , And calculating the comprehensive similarity: Wherein: The comprehensive similarity of any two standardized entities; editing the value obtained after the distance normalization for the character level similarity; Text similarity calculated based on BM25 for statistical feature similarity; cosine similarity among embedded vectors generated by the cross-language semantic model is semantic vector similarity; to normalize similarity for pronunciation, based on pinyin or phonetic code matching distance; calculating a consistency score for the structured information based on the organization, the research direction, the geographic location, etc.; For the weighting coefficient, satisfy ; All integrated similarity values greater than or equal to the similarity threshold Corresponding standardized entities form a suspected ambiguity set ; All integrated similarities less than the similarity threshold A corresponding plurality of standardized entities form a first non-ambiguous set 。
5. The rule-based screening and large model-aided entity disambiguation method of claim 1, The a priori constraints are expressed as follows: Wherein, the The method is characterized by entity commonality, and is in a text or form; representing the predictive function of the large model, The real label is marked manually or semi-automatically; is two standardized entities in the suspected ambiguity set, N is the suspected ambiguity set The total number of standardized entities; The hint engineering model is as follows: Wherein, the Is the optimal large model; Is a verification set; ( ) To verify the collection Performance evaluation index; and M is a prompt engineering model, and M is a prompt engineering model set formed by a plurality of prompt engineering models.
6. The rule-based screening and large model-aided entity disambiguation method of claim 5, wherein said passing through an optimal large model For the suspected ambiguity set Performing final disambiguation to obtain a non-ambiguous set, wherein the non-ambiguous set comprises disambiguation entity names including: Wherein: To be a set of suspected ambiguities Disambiguated second non-ambiguous set Xi is a suspected ambiguous set, wherein the first non-ambiguous set And a second non-ambiguous set Composing a standardized set of entities And the corresponding non-ambiguous set comprises disambiguation entity names.
7. The rule-based screening and large model aided entity disambiguation method of claim 4, The character level similarity The expression is as follows: Wherein, the Character-level similarity; for the minimum edit distance of two standardized entities, i.e. to be Conversion to The required number of insertion, deletion, and replacement operations; And (3) with Respectively representing the character string lengths of two standardized entities; (. Cndot.) is a maximum operation; The statistical feature similarity The expression is as follows: Wherein, the , Representing a one-way relevance score; For representation Word de-matching of (a) Entity (3) The extracted, de-duplicated set of terms in all text fields, t is a word or token extracted from the entity, Representing the matching accumulation result of the term t in the entity, such as the entity name and the entity attribute; The semantic vector cosine similarity The expression is as follows: Wherein, the , ; The similarity of the cosine of the semantic vectors of the two standardized entities is represented, and the value range is [ -1,1]; Representing a cross-language semantic coding model; representing the semantic embedded vector generated by the model; The pronunciation is normalized to the similarity The expression is as follows: Wherein, the , Code (, is indicative of a process of mapping names into a tone sequence; The structured information uniformity score expression is as follows: Wherein, the Is entity attribute Importance weights of (2) satisfy ; Is entity attribute The value range of the heterogeneity score function of (1) is 0, 1; representing entity attributes; Representing entity attributes Is measured by the consistency function of (2) And The similarity in the attribute is in the range of [0,1].
8. An entity disambiguation apparatus based on rule screening and large model assistance, the apparatus comprising: An extracting module for extracting entity attributes with or related to the undetermined entity names from the text or the bibliographic items as original entities, wherein the plurality of original entities form an original entity set The standardized module is used for standardizing the original entity set to obtain a standardized entity set; the classifying module is used for carrying out similarity calculation on any two standardized entities in the standardized entity set, classifying the standardized entities according to the similarity and a preset similarity threshold value to obtain a suspected ambiguity set and form a first non-ambiguity set; The extraction module is used for extracting entity commonality characteristics of standardized entities in the suspected ambiguity set and inducing to form prior constraint of entity disambiguation; The generation module is used for respectively encoding the prior constraints into the prompt templates, and adjusting the generation control parameters of the reasoning stage to form a plurality of prompt engineering models; The selection module is used for selecting an optimal large model from a plurality of prompt engineering models based on a verification set, wherein the verification set is a subset of the suspected ambiguity set; And the disambiguation module is used for finally disambiguating the suspected ambiguity set through the optimal large model to obtain a disambiguation set, wherein the disambiguation set comprises disambiguation entity names.
9. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 7.
10. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method of any one of claims 1 to 7.

Description

Entity disambiguation method, device, equipment and storage medium based on rule screening and large model assistance Technical Field The present invention relates to the field of entity disambiguation technologies, and in particular, to a method, apparatus, device, and storage medium for entity disambiguation based on rule screening and large model assistance. Background The disambiguation of named entities is an important research direction in artificial intelligence semantic understanding, and has wide application value in the applications of knowledge graph construction, semantic retrieval, recommendation systems, information extraction and the like. The object of the name entity disambiguation is to accurately identify the real object pointed by the entity name in the text under the condition that a plurality of same name or similar entities exist, so that the accuracy and consistency of information identification and knowledge reasoning are improved. Naming entity ambiguity manifests itself mainly in both the sense diversity and the sense ambiguity. Reference to diversity means that the same entity may have multiple expressions, such as full, shorthand, english, pinyin, mixed expressions, or other aliases, and reference to ambiguity means that different entities may use the same or similar names. The characteristics are commonly existed in the actual corpus, particularly in scientific research, enterprises and cross-language scenes, chinese and English and pinyin mixed phenomena are more prominent, and entity identification and distinction are more complex. The existing entity disambiguation method relies on keyword matching, character string similarity or rule-based templates, and candidate entity names are compared with input texts to identify. When the method is used for processing the scene of coexistence of Chinese and English mixing, pinyin and English abbreviations, the problems of inconsistent expression diversification and language conversion are difficult to deal with, and the utilization of external information (such as mechanisms, fields, time, regions and the like) of entities is insufficient, so that the disambiguation accuracy is low. With the rapid development of artificial intelligence techniques such as a large-scale pre-training language model (LargeLanguageModel, LLM), the model shows remarkable advantages in cross-language feature representation and semantic unified modeling. The large model can recognize semantic corresponding relations in Chinese, english, pinyin and other mixed texts through multi-language coding capability, and can realize more accurate semantic distinction by combining with auxiliary information (such as mechanism names, research fields, geographic positions and the like) of entities. However, the large model still has the problems of large consumption of computing resources and limited throughput in the reasoning stage, and is low in efficiency when being directly applied to large-scale text processing. In view of the above problems, the prior art has not yet lacked a method for disambiguating entity names, which can integrate semantic recognition capability of a large model with information related to an entity under the condition of considering recognition accuracy and processing efficiency. Disclosure of Invention Based on this, it is necessary to provide a method, a device and a storage medium for entity disambiguation based on rule screening and large model assistance. A rule screening and large model aided entity disambiguation method, the method comprising: extracting entity attributes with or related to the non-disambiguated entity names from the text or the bibliographic items as original entities; normalizing the original entity set to obtain a normalized entity set; Performing similarity calculation on any two standardized entities in the standardized entity set, and classifying the standardized entities according to the similarity and a preset similarity threshold value to obtain a suspected ambiguity set and form a first non-ambiguity set; Carrying out entity commonality feature extraction on standardized entities in the suspected ambiguity set, and inducing to form an priori constraint of entity disambiguation; Encoding the prior constraints into prompt templates respectively, and adjusting the generation control parameters of the reasoning stage to form a plurality of prompt engineering models; Selecting an optimal large model from a plurality of prompt engineering models based on a verification set, wherein the verification set is a subset of a suspected ambiguity set; And finally disambiguating the suspected ambiguity set through the optimal large model to obtain a non-ambiguity set, wherein the non-ambiguity set comprises disambiguation entity names. In one embodiment, the method extracts entity attributes with or associated with the undetermined entity names from text or bibliographic items as original entities, the plurality of original e