DE-112021006042-B4 - FINDING THE STORAGE LOCATIONS OF TABLE DATA ACROSS SYSTEMS

DE112021006042B4DE 112021006042 B4DE112021006042 B4DE 112021006042B4DE-112021006042-B4

Abstract

A computer-implemented method for finding data storage locations, wherein the method comprises: Assuming first synopses of first columns of a corpus (12) of first table data (14) stored in one or more reference data storage systems (1), searching (S15) an additional data storage system (2) to identify second table data (24) stored in the additional data storage system, wherein the second table data contains a second column; Obtaining second synopses (261-266) of the second columns of the second table data by computing (S40), for each second column of the second columns, a synopsis according to a numerical representation (251-256) of a content of cells of each second column, wherein the synopsis contains a vector of m descriptors (16) with m ≥ 1, each of which is a measure of the numerical representation; Identifying (S50) a subset of descriptors with regard to corresponding descriptors of the first synopses of the corpus as descriptors with the most unusual values; and Comparing (S70) two sets of one or more descriptors, comparing only the subset of descriptors from the second synopses with corresponding descriptors from the first synopses to identify matches between the second table data and the corpus of first table data.

Inventors

John Rooney
Luis Garcés Erice

Assignees

INTERNATIONAL BUSINESS MACHINES CORPORATION

Dates

Publication Date: 20260513
Application Date: 20211121
Priority Date: 20201211

Claims (20)

A computer-implemented method for locating data storage locations, comprising: assuming first synopses of the first columns of a corpus (12) of first tabular data (14) stored in one or more reference data storage systems (1), searching (S15) an auxiliary data storage system (2) to identify second tabular data (24) stored in the auxiliary data storage system, wherein the second tabular data contains a second column; obtaining second synopses (261-266) of the second columns of the second tabular data by computing (S40), for each second column of the second columns, a synopsis according to a numerical representation (251-256) of the contents of cells of each second column, wherein the synopsis contains a vector of m descriptors (16) with m ≥ 1, each of which is a measure of the numerical representation; Identifying (S50) a subset of descriptors with regard to corresponding descriptors of the first synopses of the corpus as descriptors with the most unusual values; and Comparing (S70) two sets of one or more descriptors, comparing only the subset of descriptors of the second synopses with corresponding descriptors of the first synopses in order to identify matches between the second table data and the corpus of first table data.
Computer-implemented method according to Claim 1 , where the degree of unusualness of a descriptor value is determined with regard to the distance between that value and a reference value.
Computer-implemented method according to Claim 2 , where the reference value is an arithmetic mean for a descriptor of a type of columns in the corpus, where the descriptor whose degree of unusualness is being determined is of the same type.
A computer-implemented method according to one of the preceding claims, wherein the method further comprises: Storing (S55) the second synopses, which are obtained as compressed data structures, the latter containing the descriptors with the most unusual values, and discarding remaining descriptors.
Computer-implemented method according to one of the Claims 1 until 3 , wherein the method further comprises: storing the second synopses, which are obtained as compressed data structures, the latter containing the descriptors with the most unusual values along with their indices, and discarding any remaining descriptors of the second synopses; and/or storing the first synopses, which are obtained as compressed data structures, the latter containing the descriptors with the most unusual values along with their indices, and discarding any remaining descriptors of the first synopses; and/or using a stream-processing, distributed messaging system to send the subset of descriptors from the second synopses to a remotely located system for comparison of the two sets.
Computer-implemented method according to Claim 4 or 5 , the procedure further includes: updating the first synopses of the corpus according to the second synopses, which are stored as compressed data structures.
A computer-implemented method according to one of the preceding claims, wherein: when comparing the two sets, all descriptors of the second synopses are compared with corresponding descriptors of the first synopses in order to identify the matches.
Computer-implemented method according to Claim 7 , where: when comparing the two sets, the matches are identified according to a metric that counts a number of matching synopses for each pair of tables.
Computer-implemented method according to one of the preceding Claims 1 until 6 , where: comparing the two sets of one or more descriptors involves calculating a correlation measure, a distance measure and a similarity measure between the two sets of descriptors.
A computer-implemented method according to one of the preceding claims, wherein: a calculation of the synopsis further comprises a conversion of the contents of the cells of every second column into corresponding numerical values in order to obtain the numerical representation, unless the cells already consist of values suitable for obtaining the synopsis.
Computer-implemented method according to Claim 10 , wherein: a computation (S40) of the synopsis further includes, prior to a conversion (S34) of the content, identifying a data type value that represents a type of data contained in the cells of every second column; and converting the content into the corresponding numeric values on the basis of a conversion procedure selected (S32) according to the data type value.
Computer-implemented method according to Claim 11 , where: the conversion method is selected from one or more: a hash value method, a location-sensitive hash value method, a location-preserving hash value method and a feature extraction method.
A computer-implemented method according to any one of the preceding claims, wherein the method further comprises, prior to searching the auxiliary data storage system: Searching the one or more reference data storage systems to identify the corpus of initial tabular data; and Obtaining the initial synopses by computing, for each initial column of the initial columns, a synopsis according to a numerical representation of the contents of cells in each initial column, wherein this synopsis contains a vector of m descriptors with m ≥ 1, each of which is a measure of the numerical representation.
Computer-implemented method according to Claim 13 , wherein: the procedure further involves updating the first synopses of the corpus according to descriptors of the received second synopses.
A computer-implemented method according to one of the preceding claims, wherein: the synopsis calculated for each second column contains a vector of m ≥ 2 descriptors, wherein these contain one or more statistical descriptors, each of which is a statistical measure of the numerical representation.
Computer-implemented method according to Claim 15 , where: each of the statistical descriptors is calculated according to a measure from a storage location, a distribution and a form of data values of the numerical representation.
A computer-implemented method according to any of the preceding claims, wherein: the method further comprises performing an action with respect to the second table data, which are found to be identical to any of the first table data in the corpus.
Computer-implemented method according to Claim 17 , wherein the action includes one or more of the following: logging the detected match in a database, relocation, deletion or modification of approvals.
Computer program product for finding data storage locations, wherein the computer program product comprises a computer-readable storage medium containing program instructions, wherein the program instructions are executable by a processing means to cause the processing means to execute the computer-implemented method according to one of the preceding claims.
Computer system for finding data storage locations, the system comprising: a memory (110); and a processor (105) that communicates with the memory, the processor being configured to cause the system to perform the procedure according to one of the preceding Claims 1 until 18 to execute.

Description

BACKGROUND The invention relates generally to computer-implemented methods for locating data storage locations. In particular, it relates to a computer program product and a computer-implemented method for locating tabular data, which rely on statistical fingerprinting of table columns to reduce the computational effort required to identify potential matches between tabular data stored in different storage systems. A company typically has many transactional and record-keeping systems that record data related to its business. This data is often stored in relational database tables. Such tables are described as "rectangular" because the data is arranged in columns; they are typically displayed to users as rows and columns of a table. The data from these core relational systems is frequently copied to other systems, for example, for analysis or other routine business purposes. While the data retains its rectangular shape, it may be stored in different formats, under different file names, and in a variety of storage systems. As data is copied, copied again, and reshaped, it becomes increasingly difficult for a company to know where that data has gone. This is complicated by the fact that users, through a carefully controlled and managed process, can gain access to a restricted data source and make it available in other contexts without authorization or control. As a result, data records containing sensitive information may end up stored in a storage system without access control and/or encryption. This is often unintentional and more the result of negligence than malicious intent, as illustrated by the following examples. A user might unknowingly move data from a suitable company system to an unsuitable one. Furthermore, the user might reshape the data, using only a subset of it or removing some rows or columns. Additionally, the user might rename the data file so that the name under which the data is stored bears no relation to the original data file. A sales representative might take a random sample of 10% of a company's most important customers, remove the address field, and unknowingly save the sample to a file stored on a third-party storage service. To address such problems, solutions have been developed, most of which involve creating a data fingerprint and a data watermark. Creating a data fingerprint allows for the generation of a small extract from a dataset, ensuring that two identical datasets have the same extract. The common way to do this is by generating hash values (also called fingerprints). However, hash values use every bit of information as input, so changing a single bit can result in a completely different extract; that is, hash values are affected by even the smallest changes. An improvement is achieved by using a sliding window of unique hash values and measuring the most representative bit patterns within the dataset. However, this type of fingerprinting operates at the bit level. Since the same relational data can be stored in a variety of storage systems within different data formats, bit-level fingerprinting is insufficient or unsuitable in most cases. More often, a watermark can be inserted into the data so that the record can be identified even if it is moved to a different storage system and modified. Such methods have the significant disadvantage that all rectangular datasets generated within the organization must be systematically watermarked, using a prescribed algorithm that is typically not available in the tools used to generate the data. This is impractical for the (hundreds of) thousands of records that are typically generated and updated across a large number of data storage systems in a large organization, many of which are not even managed by the organization itself (but, for example, by a cloud provider). The printed matter US 2016 / 0 275 150 A1 This concerns a procedure for comparing tabular data, which includes the following: identifying a block containing a subset of data rows from a source database table and a corresponding block containing a subset of data rows from a second database table; obtaining a statistical value associated with data contained in the identified block of the source table. are contained, and obtain another statistical value of the data contained in the corresponding block of the target table block; compare the statistical values to determine a match result, and determine, based on a result of the comparison, whether the block of each source and target database table is consistent, wherein a programmed processor device performs the identification, preservation, comparison, and determination operations. SUMMARY The invention relates to a computer-implemented method, computer program product, and computer system for finding data storage locations, the features of which are specified in the corresponding independent claims. Embodiments of the invention are specified in the dependent claims. A computer-implemented procedure for locating data sto