Search

CN-115516441-B - Method, system and program product for identifying entities in a database system

CN115516441BCN 115516441 BCN115516441 BCN 115516441BCN-115516441-B

Abstract

A computer-implemented method for explicitly identifying entities in a database system may be provided. The method includes storing data items in a table of a database as records having different attributes, storing naming rules for selected combinations of attributes of the data items, and prioritizing the naming rules. The method further includes determining a hash value for each of the selected combinations of attributes of the data items, and identifying duplicate data items using the determined hash values and the prioritized naming convention.

Inventors

  • M. Borgioni
  • M. PHILIP
  • M. Luczynski
  • T. Zatoski
  • A. Raskawik
  • M. Piatek
  • 50. Studenny

Assignees

  • 国际商业机器公司

Dates

Publication Date
20260505
Application Date
20210305
Priority Date
20200403

Claims (20)

  1. 1. A computer-implemented method for explicitly identifying entities in a database system, the method comprising: Storing data items in a table of a database, wherein the data items are stored as records comprising a plurality of attributes; storing naming rules for a selected combination of attributes of the data items, wherein the naming rules are based on the plurality of attributes, and wherein a combination includes at least a customer name and a customer-associated employer identification number, EIN; Prioritizing the naming convention by defining a sequence in which the naming convention is applied or defining an order in which the naming convention is applied, according to its importance to entity identification, wherein the combination of the customer name and the customer-related EIN has a relatively higher priority than the customer name alone; Determining a hash value for each of the selected combinations of attributes of the data items; identifying duplicate data items using the determined hash value and the prioritized naming convention, and Merging the identified duplicate data items into one merged data item and defining the attribute source of the merged data item using a higher priority naming convention.
  2. 2. The method of claim 1, wherein the database system is a relational database system, and wherein an entry in the relational database system automatically triggers a database engine to create a naming convention associated with the entry.
  3. 3. The method of claim 1, wherein the database system is a configuration management database that is the basis of a specific internal organization and is used to manage technical equipment and applications in a plurality of data centers.
  4. 4. The method of claim 1, further comprising merging the identified duplicate data items by maintaining the determined hash value as a multi-valued key for the merged data item.
  5. 5. The method of claim 4, further comprising merging other data items in a composite relationship with the identified data item.
  6. 6. The method of claim 4, further comprising maintaining a pointer to the same row identifier of one of the merged data items for the determined hash value.
  7. 7. The method of claim 1, further comprising: Maintaining an index of the table, and A pointer is maintained in a search tree associated with the index such that the pointer points to the same record identifier of a combined data item, wherein the combined data item is determined based on splitting an upper level name into two or more separate aliases, and each of the two or more separate aliases is used as a single value in the index.
  8. 8. The method of claim 1, further comprising: a create SQL statement is used that is adapted to create naming rules and their associated priorities.
  9. 9. The method of claim 1, further comprising: records in the database table are ordered using the multi-valued primary key.
  10. 10. The method of claim 1, wherein the multi-valued primary key is used to cluster data on a multi-node database engine.
  11. 11. The method of claim 9, wherein the multi-valued primary key is comparable to a single-valued column data item.
  12. 12. The method of claim 1, further comprising: statistical database data of data blocks of single-value primary keys and multi-value primary keys is collected.
  13. 13. A computer system for explicitly identifying entities in a database system, the computer system comprising: One or more computer processors, one or more computer-readable storage media, and program instructions stored on the one or more computer-readable storage media for execution by at least one of the one or more processors capable of performing a method comprising: Storing data items in a table of a database, wherein the data items are stored as records comprising a plurality of attributes; storing naming rules for a selected combination of attributes of the data items, wherein the naming rules are based on the plurality of attributes, and wherein a combination includes at least a customer name and a customer-associated employer identification number, EIN; Prioritizing the naming convention by defining a sequence in which the naming convention is applied or defining an order in which the naming convention is applied, according to its importance to entity identification, wherein the combination of the customer name and the customer-related EIN has a relatively higher priority than the customer name alone; Determining a hash value for each of the selected combinations of attributes of the data items; identifying duplicate data items using the determined hash value and the prioritized naming convention, and Merging the identified duplicate data items into one merged data item and defining the attribute source of the merged data item using a higher priority naming convention.
  14. 14. The computer system of claim 13, wherein the database system is a relational database system, and wherein an entry in the relational database system automatically triggers a database engine to create a naming convention associated with the entry.
  15. 15. The computer system of claim 13, wherein the database system is a configuration management database that is the basis of a particular internal organization and is used to manage technical equipment and applications in a plurality of data centers.
  16. 16. The computer system of claim 13, further comprising merging the identified duplicate data items by maintaining the determined hash value as a multi-valued key for the merged data item.
  17. 17. The computer system of claim 16, further comprising merging other data items in a composite relationship with the identified data item.
  18. 18. The computer system of claim 16, further comprising maintaining a pointer to the same row identifier of one of the merged data items for the determined hash value.
  19. 19. The computer system of claim 13, further comprising: Maintaining an index of the table, and A pointer is maintained in a search tree associated with the index such that the pointer points to the same record identifier of a combined data item, wherein the combined data item is determined based on splitting an upper level name into two or more separate aliases, and each of the two or more separate aliases is used as a single value in the index.
  20. 20. A computer program product for explicitly identifying entities in a database system, the computer program product comprising: one or more non-transitory computer-readable storage media and program instructions stored on the one or more non-transitory computer-readable storage media capable of performing a method comprising: Storing data items in a table of a database, wherein the data items are stored as records comprising a plurality of attributes; storing naming rules for a selected combination of attributes of the data items, wherein the naming rules are based on the plurality of attributes, and wherein a combination includes at least a customer name and a customer-associated employer identification number, EIN; Prioritizing the naming convention by defining a sequence in which the naming convention is applied or defining an order in which the naming convention is applied, according to its importance to entity identification, wherein the combination of the customer name and the customer-related EIN has a relatively higher priority than the customer name alone; Determining a hash value for each of the selected combinations of attributes of the data items; identifying duplicate data items using the determined hash value and the prioritized naming convention, and Merging the identified duplicate data items into one merged data item and defining the attribute source of the merged data item using a higher priority naming convention.

Description

Method, system and program product for identifying entities in a database system Background The present invention relates generally to database systems and, more particularly, to a computer-implemented method for explicitly identifying entities in a database system. The invention further relates to a related database system for explicitly identifying entities in a database system, and to a computer program product adapted to perform the method. Enterprise information management remains one of the key topics for enterprise IT (information technology) organizations. This applies not only to large global 2000 companies but also to small and medium-sized enterprises. The reason is simple that in the information age, the number of sources used for data management and the absolute amount of data to be managed as successful are growing. One way to address this difficulty is to create enterprise data directories in the context of data warehouse items and use data management concepts. However, reality has shown that this approach is quite difficult because new data sources and new types of data impact IT organizations in even shorter periods of time. Thus, there is a need for a more dynamic way to address this common problem of duplicate data objects in many enterprise applications and across those applications. For example, the same customer may be entered multiple times into the ERP system, with slightly different names, or with employer identification numbers (e.g., tax numbers) placed in different formats. Creating some unit constraints is not enough because the same customer name can be saved in uppercase or mixed cases or in full company names or short names. Those topics can be found over time, but the merging of other relevant data is often difficult, time consuming and sometimes impossible. Over time, invoices, orders, and many other related data objects may be created in the ERP system, and also in related systems, such as customer relationship management systems (CRM), supply chain management Systems (SCM), logistics systems, and the like. Moreover, all of these systems use duplicate customer records. For example, if a customer discount is desired based on the accumulated order value, it needs to aggregate the saved order with different customer data, a function that may not be available in today's ERP systems. However, the described problems exist not only in databases storing customer data and the like, but also in databases for highly technical applications, such as Configuration Management Databases (CMDBs) for controlling, prioritizing and allowing or denying access to computing resources. There are several disclosures of computer implemented methods for deduplicating an item storage system. Document US 2017/0308557 A1 discloses a method and system for cleansing and deduplicating data in a database. The method includes filtering garbage records from a plurality of records based on the data fields, and applying cleansing rules to create a cleansing database. Similarity vectors are generated, where each vector corresponds to a pairwise comparison of remote data entries in the cleaned database. A matching rule is applied to label each vector as one of matched, unmatched, and unclassified. Furthermore, document US 2017/0011088 A1 discloses a method for looking up a doublet in a database, comprising hash values calculated for at least two field groups of a record in the database, wherein a field group comprises at least two fields of a record, and the hash values of a minority group of records are based on values stored in at least two years of the corresponding field group in the corresponding record. However, as noted above, these known methods of solving this problem do not address the difficulties faced by the organization of enterprise information when seamlessly and effortlessly processing multiple entries in a database supporting enterprise and/or technology applications. It may therefore be desirable to overcome the above-described technical problems and provide an advanced solution to correctly store and explicitly identify entities involving the same and/or different data objects. Disclosure of Invention According to one aspect of the present invention, a computer-implemented method for explicitly identifying entities in a database system may be provided. The method may include storing data items in a table of a database, the data items being stored as records comprising a plurality of attributes, storing naming rules for selected combinations of attributes of the data items, and prioritizing the naming rules. The method may further include determining a hash value for each of the selected combinations of attributes of the data items, and identifying duplicate data items using the determined hash values and prioritized naming convention. According to another aspect of the present invention, a database system for explicitly identifying entities in a database system may be provided. The database s