US-12619825-B2 - List-based entity name detection
Abstract
List-based entity name detection implementations are described that detect entity names in electronic textural documents. In one implementation, unknown entity names are detected. In another implementation, ambiguous entity names are detected and disambiguated. In yet another implementation, generic entity names are detected and associated with an applicable species entity name.
Inventors
- Robert J. Fox
Assignees
- HG Insights, Inc.
Dates
- Publication Date
- 20260505
- Application Date
- 20231212
Claims (12)
- 1 . A system for disambiguating ambiguous entity names in documents, comprising: an ambiguous entity name detector comprising one or more computing devices, and an ambiguous entity name detection computer program having a plurality of sub-programs executable by said computing device or devices, wherein the sub-programs configure said computing device or devices to, access document data, said document data comprising at least one list structure that has entity names as elements and wherein one or more of the elements in the at least one list structure is an ambiguous entity name, identify each list structure in the document data, for each list structure, separate the words in the list structure under consideration into candidate entity names, compare each candidate entity name to a known entity name listing, wherein the known entity name listing comprises known non-ambiguous entity names each of which is assigned a single entity type and a single category, and further comprises known ambiguous entity names each of which is assigned a single entity type and a single category, and wherein an entity name in the known entity name listing is an ambiguous entity name if that entity name corresponds to a known entity name and is additionally known to correspond to a name comprising a noun or verb or both unrelated to the known entity name, determine if there is a match between at least one candidate entity name and an ambiguous entity name in the known entity name listing, and whenever there is a match found between at least one candidate entity name and an ambiguous entity name in the known entity name listing, for each candidate entity name in the list structure under consideration matching such an ambiguous entity name, determine if there is a match between another candidate entity name in the list structure under consideration and a non-ambiguous entity name in the known entity name listing whose assigned entity type and category matches the entity type and category assigned to the ambiguous entity name in the known entity name listing that matched the candidate entity name under consideration, and whenever such an entity type and category match is found, tag the candidate entity name under consideration with information indicative of the matching entity type and category to produce a disambiguated list structure in the accessed document data.
- 2 . The system of claim 1 , wherein the known entity name listing comprises known entity names that are assigned one or more positive qualifiers, or one or more negative qualifiers, or both, and which can only be matched to a candidate entity name if each of the assigned positive qualifiers, if any, are satisfied and if each of the assigned negative qualifiers, if any, are not satisfied.
- 3 . The system of claim 1 , wherein the sub-program for separating the words in each list structure into candidate entity names is preceded by a sub-program for cleaning each identified list structure by removing extraneous words and symbols that are not likely to represent a potential entity name.
- 4 . The system of claim 1 , wherein each candidate entity name comprises a single word or a multiple-word phrase.
- 5 . The system of claim 1 , further comprising a sub-program for tagging each candidate entity name in the document data found to match an entity name in the known entity name listing with the entity type and category assigned to that entity name.
- 6 . The system of claim 1 , further comprising a sub-program for, whenever no match is found between a candidate entity name and an ambiguous entity name in the known entity name listing, or even if a match is found between a candidate entity name and an ambiguous entity name in the known entity name listing but no match is found between another candidate entity name and a non-ambiguous entity name in the known entity name listing whose assigned entity type and category matches the entity type and category assigned to the ambiguous entity name in the known entity name listing that matched the candidate entity name under consideration, the list structure under consideration is disregarded.
- 7 . A system for disambiguating generic entity names in documents by associating and tagging a detected generic entity name with an applicable species entity name, comprising: a generic entity name detector comprising one or more computing devices, and a generic entity name detection computer program having a plurality of sub-programs executable by said computing device or devices, wherein the sub-programs configure said computing device or devices to, access document data, said document data comprising at least one list structure that has entity names as elements and wherein one or more of the elements in the at least one list structure is a generic entity name, identify each list structure in the document data, for each list structure, separate the words in the list structure under consideration into candidate entity names, wherein each candidate entity name comprises a single word or a multiple-word phrase, compare each candidate entity name to a known entity name listing, wherein the known entity name listing comprises known non-generic entity names each of which is assigned a single entity type and a single category, and further comprises known generic entity names each of which is associated with a separate sub-list of species entity names applicable to the generic entity name wherein each of the species entity names is assigned a single entity type and a single category, determine if there is a match between at least one candidate entity name and a non-generic entity name in the known entity name listing, determine if there a match between at least one candidate entity name in the list structure under consideration and a generic entity name in the known entity name listing, and whenever a match is found between at least one candidate entity name in the list structure under consideration and a non-generic entity name in the known entity name listing, as well as a match between at least one candidate entity name in the list structure under consideration and a generic entity name in the known entity name listing, for each candidate entity name in the list structure under consideration found to match a non-generic entity name in the known entity name listing and each candidate entity name in the list structure under consideration found to match a generic entity name in the known entity name listing, determine if a species entity name associated with the matching generic entity name is assigned the same entity type and category as the non-generic entity name in the known entity name listing that matched a candidate entity name, and if a match is found, associate the identified species entity name to the candidate entity name found to match the generic entity name in the known entity name listing as a candidate species entity name, and tag the candidate entity name found to match the generic entity name in the known entity name listing with information indicative of the identified species entity name and its assigned entity type and category to produce a disambiguated list structure in the accessed document data.
- 8 . The system of claim 7 , wherein the known non-generic entity names, or known generic entity names, or both, of the known entity name listing are each assigned one or more positive qualifiers, or one or more negative qualifiers, or both, and which can only be matched to a candidate entity name if each of the assigned positive qualifiers, if any, are satisfied and if each of the assigned negative qualifiers, if any, are not satisfied.
- 9 . The system of claim 7 , wherein the sub-program for separating the words in each list structure into candidate entity names is preceded by a sub-program for cleaning each identified list structure by removing extraneous words and symbols that are not likely to represent a potential entity name.
- 10 . The system of claim 7 , further comprising sub-programs for: for each candidate entity name associated with more than one candidate species entity name, receiving an instruction that designates the species entity name as well as the entity type and category that is to be assigned to the candidate entity name, assigning the designated species entity name as well as the designated entity type and category to the candidate entity name and eliminating each other candidate entity species name and its assigned entity type and category previously associated with and tagged to the candidate entity name, and tagging the candidate entity name in the document data with the assigned species entity name along with the assigned entity type and category.
- 11 . The system of claim 7 , further comprising a sub-program for tagging each candidate entity name in the document found to match a non-generic entity name in the known entity name listing with the entity type and category assigned to that entity name.
- 12 . The system of claim 7 , further comprising a sub-program for, whenever no match is found between one or more candidate entity names in the list structure and non-generic entity names in the known entity name listing, disregard the list structure under consideration.
Description
CROSS REFERENCE TO RELATED APPLICATION This application claims the benefit of and priority to U.S. patent application Ser. No. 16/550,684 filed Aug. 26, 2019. BACKGROUND Named entity recognition is widely used to detect an instance of a named entity in electronic textual documents such as web pages, Portable Document Format (PDF) documents, word processor documents, and so on. Once detected, the knowledge that a named entity is mentioned in a document can be put to a myriad of uses. For example, the document containing an instance of a named entity is sometimes flagged as applicable to that entity, and then stored for future reference. The instance of a named entity in a document is also often tagged with information about the entity or a link pointing to such information. Tagging named entities also allows for indexing, which is used for quicker retrieval of documents based on a search query directed toward the tagged entity. Entity names could be product names (e.g., a brand name), which includes the names of both tangible and intangible products as well as services. Entity names could also be the name of a person, or a person's title, or a movie, or a book title, or a song title, or the name of a business or government office, or a location. Entity names could also refer to technologies. For example, a document might include a list of electronic entertainment technologies such as computer-generated imagery, immersive virtual reality and ultra-high-definition television. Still further, entity names could refer to a type of product (such as a car), or equipment (such as trenchers, chippers, mini-excavators, skid steers, aerial work platforms, tractor loader backhoes, and other types of equipment used in construction). In general, the entity names can be just about anything. SUMMARY List-based entity name detection implementations (entity name detection implementations for short) described herein generally identify entity names in documents. One exemplary implementation takes the form of a system for detecting unknown entity names in documents. This system includes an unknown entity name detector having one or more computing devices, and an unknown entity name detection computer program having a plurality of sub-programs executable by the computing device or devices. The sub-programs configure the computing device or devices to first access document data and identify each list structure in the document data. For each list structure, a sub-program then separates the words in each list structure into candidate entity names. Another sub-program then compares each candidate entity name to a known entity name listing. In general, the known entity name listing includes known entity names, each of which is assigned a single entity type and a single category. Next, for each candidate entity name found to match an entity name in the known entity name listing, a sub-program assigns the entity type and category assigned to that entity name in the known entity name listing to each candidate entity name not matching an entity name in the known entity name listing as a candidate entity type and category for that candidate entity name. Another exemplary implementation takes the form of a system for detecting ambiguous entity names in documents. This system includes an ambiguous entity name detector having one or more computing devices, and an ambiguous entity name detection computer program having a plurality of sub-programs executable by the computing device or devices. The sub-programs configure the computing device or devices to first access document data and identify each list structure in the document data. For each list structure, a sub-program then separates the words in each list structure into candidate entity names. Another sub-program then compares each candidate entity name to a known entity name listing. This known entity name listing includes known non-ambiguous entity names, each of which is assigned a single entity type and a single category, and further includes known ambiguous entity names, each of which is assigned a single entity type and a single category. An entity name in the known entity name listing is an ambiguous entity name if that entity name can correspond to a known entity name or to an unrelated item. When there is a match found between at least one candidate entity name and an ambiguous entity name in the known entity name listing, for each candidate entity name matching such an ambiguous entity name, it is determined if there is a match between another candidate entity name and a non-ambiguous entity name in the known entity name listing whose assigned entity type and category matches the entity type and category assigned to the ambiguous entity name in the known entity name listing that matched the candidate entity name under consideration. When such an entity type and category match exists, the candidate entity name under consideration is designated as corresponding to the matching ambi