US-20260127318-A1 - System, Method, and Device for Data Anonymization
Abstract
A device, method, and system for treating data from legacy infrastructure is disclosed. Illustratively, the device memory stores computer executable instructions that when executed by the processor cause the processor to provide a dataset comprising a plurality of characters and provide a token table for tokenizing datasets. The token table includes mappings that define replacement tokens for characters in datasets. The instructions cause the processor to generate a tokenized dataset based on the dataset by (1) for each contiguous sequence of letter characters of the plurality of characters, determining a respective letter token having the same length as the respective contiguous sequence, and (2) generate the tokenized dataset by replacing each contiguous sequence of letter characters of the plurality of characters with the determined respective letter token.
Inventors
- Tejinder Pal Singh DHALIWAL
Assignees
- THE TORONTO-DOMINION BANK
Dates
- Publication Date
- 20260507
- Application Date
- 20260105
Claims (20)
- 1 . A device for tokenizing data, the device comprising: a processor; and a memory coupled to the processor, the memory storing computer executable instructions that when executed by the processor cause the device to: for each contiguous sequence of letter characters of a plurality of characters comprised in a dataset, determine from a token table a respective letter token having the same length as the respective contiguous sequence; and generate a tokenized dataset by replacing each contiguous sequence of letter characters of the plurality of characters with the determined respective letter token.
- 2 . The device of claim 1 , wherein the dataset is one of a plurality of datasets, and at least two of the plurality of datasets have data in different formats and the computer executable instructions cause the device to: with the token table, generate tokenized datasets for each dataset of the plurality of datasets, each tokenized dataset preserving the format of the corresponding dataset.
- 3 . The device of claim 2 , wherein the computer executable instructions cause the processor to: distribute the token table to a plurality of nodes; and provide subsets of the plurality of datasets to the plurality of nodes to distribute generation of the tokenized datasets.
- 4 . The device of claim 1 , wherein the computer executable instructions cause the device to: refresh the token table; and refresh at least one tokenized dataset generated with the token table that preceded the refreshed token table.
- 5 . The device of claim 1 , wherein the token table and the dataset are in a first environment having a first level of access, and where the computer executable instructions cause the device to: transmit the tokenized dataset to a second environment having a second level of access, the second environment not having access to the token table or the dataset.
- 6 . The device of claim 1 , wherein, to generate the tokenized dataset, the computer executable instructions cause the device to: identify occurrences of temporal data within the plurality of characters based on one or more preconfigured reference character sequences; and for each identified occurrence, replace the corresponding characters of the plurality of characters with a replacement sequence defined in the token table, the replacement sequence preserving a format of the occurrence.
- 7 . The device of claim 6 , wherein the token table defines the replacement sequence by randomly adding one or more temporal units to the identified occurrence.
- 8 . The device of claim 1 , wherein, to generate the tokenized dataset, the computer executable instructions cause the device to: for each contiguous sequence of number characters of the plurality of characters, determine a respective number token having the same length as the respective contiguous sequence; and generate the tokenized dataset by replacing each contiguous sequence of number characters of the plurality of characters with the determined respective number token.
- 9 . The device of claim 8 , wherein the token table is prepopulated with a plurality of tokens for a plurality of contiguous number sequences.
- 10 . The device of claim 1 , wherein, to generate the tokenized dataset, the computer executable instructions cause the device to: identify subsets of the dataset that contain only numbers; determine if the identified subsets satisfy criteria for implementing additional anonymization features; in response to determining the criteria are satisfied, apply one or more additional measures to the identified subsets to generate replacement tokens for the identified subsets; and generate the tokenized dataset by replacing each identified subset with the generated replacement tokens for the identified subsets.
- 11 . The device of claim 1 , wherein, to generate the tokenized dataset, the computer executable instructions cause the device to: identify subsets of the dataset that contain only numbers; determine if the identified subsets satisfy criteria for implementing additional safety features; in response to determining the criteria are unsatisfied, divide the identified subsets into further subsets that have the same length as a maximum size number only token value; based on the further subsets, determine the corresponding number only token in the token table; and generate the tokenized dataset by replacing each identified subset by combining the corresponding number only tokens for each further subset.
- 12 . A method for tokenizing data, the method comprising: determining, for a dataset comprising a plurality of characters, whether subsets of the dataset comprise one of: alphanumerical strings, only numbers, or temporal entry; for each alphanumerical string subset, applying a first set of mapping constraints of a token table to replace the respective subset; for each only numbers subset, applying a second set of mapping constraints of the token table to replace the respective subset; for each temporal entry subset, applying a third set of mapping constraints of the token table to replace the respective subset; and generating a tokenized dataset by replacing each determined subset with the respective replacement subset.
- 13 . The method of claim 12 , wherein the dataset is one of a plurality of datasets, and at least two of the plurality of datasets have data in different formats, the method comprising: with the token table, generating tokenized datasets for each dataset of the plurality of datasets, each tokenized dataset preserving the format of the corresponding dataset.
- 14 . The method of claim 13 , further comprising: distributing the token table to a plurality of nodes; and providing subsets of the plurality of datasets to the plurality of nodes to distribute generation of the tokenized datasets.
- 15 . The method of claim 12 , further comprising: refreshing the token table; and refreshing at least one tokenized dataset generated with the token table that preceded the refreshed token table.
- 16 . The method of claim 12 , wherein the token table and the dataset are in a first environment having a first level of access, the method comprising: transmitting the tokenized dataset to a second environment having a second level of access, the second environment not having access to the token table or the dataset.
- 17 . The method of claim 12 , further comprising: prepopulating the token table with a plurality of tokens for a plurality of contiguous number sequences to replace only numbers subsets.
- 18 . The method of claim 12 , wherein: the first set of mapping constraints of the token table generate tokenized subsets by replacing contiguous letter characters with contiguous letter characters having a corresponding length; the second set of mapping constraints of the token table generate tokenized subsets by replacing contiguous number characters with randomized number tokens having a corresponding length; and the third set of mapping constraints of the token table generate tokenized subsets by randomly incrementing or decreasing the determined temporal entry subset.
- 19 . The method of claim 18 , wherein: the first set of mapping constraints of the token table generate tokenized subsets by replacing contiguous number characters with the second set of mapping constraints.
- 20 . A non-transitory computer readable medium for tokenizing data, the computer readable medium comprising computer executable instructions for: determining, for a dataset comprising a plurality of characters, whether subsets of the dataset comprise one of: alphanumerical strings, only numbers, or temporal entry; for each alphanumerical string subset, applying a first set of mapping constraints of a token table to replace the respective subset; for each only numbers subset, applying a second set of mapping constraints of the token table to replace the respective subset; for each temporal entry subset, applying a third set of mapping constraints of the token table to replace the respective subset; and generating a tokenized dataset by replacing each determined subset with the respective replacement subset.
Description
CROSS-REFERENCE TO RELATED APPLICATION(S) This application is a Continuation of U.S. patent application Ser. No. 18/596,929 filed on Mar. 6, 2024, the contents of which are incorporated herein by reference in their entirety. TECHNICAL FIELD The following relates generally to methods of anonymizing data. BACKGROUND Existing digital architectures may impose constraints on data access and these constraints can result in burdensome processes. Constraints on data access in some approaches have resulted in the application of anonymization processes to generate anonymized data to provide to users lacking access to the data. Existing anonymization approaches can be poorly implemented, for example, they can prevent a data engineer without access to the sensitive data from understanding how sensitive (alternatively referred to as “production” data) is formatted, and other characteristics of the production data. As knowing the production data characteristics can be a prerequisite for performing certain tasks, users can waste considerable time navigating data access processes (e.g., receiving permission to access the production data to generate test data) without performing any substantive tasks, or in time sensitive application, data stewards work around the existing constraints and show the other user the access controlled data (e.g., in a joint in-person debug session). The anonymization techniques can be counterproductive as they can result in a lack of or poor quality of anonymized data. The anonymization can fail to preserve the format in the production data, or only partially preserve format, making it difficult to rely on the anonymized data. For users that rely on the anonymized data, existing anonymization techniques can lead to poor data (e.g., testing data), or lack thereof, can decrease timeliness, increase development costs, increase application failure or unintended performance risk. Some existing approaches require preliminary or preceding work to gain access to the production data (or to produce adequate substitutes) that can be greater than the substantive work. In addition, the lack of or poor quality of the testing data can place unnecessary stress and difficulty among users developing applications (e.g., developers, quality engineers, performance engineers, etc.). BRIEF DESCRIPTION OF THE DRAWINGS Embodiments will now be described with reference to the appended drawings wherein: FIG. 1 is a schematic diagram of an example computing environment. FIG. 2A is a diagram illustrating data moving through an existing framework for managing data within an enterprise. FIG. 2B is a diagram illustrating an example framework for anonymizing data. FIG. 2C is a diagram illustrating example data anonymization. FIGS. 3A, 3B are each a flow diagram of an example embodiment of computer executable instructions for implementing a method for anonymizing data. FIG. 4 is a block diagram of an example configuration of an example device. FIG. 5 is a block diagram of an example server. DETAILED DESCRIPTION It will be appreciated that for simplicity and clarity of illustration, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements. In addition, numerous specific details are set forth to provide a thorough understanding of the example embodiments described herein. However, it will be understood by those of ordinary skill in the art that the example embodiments described herein may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the example embodiments described herein. Also, the description is not to be considered as limiting the scope of the example embodiments described herein. This disclosure includes a tokenizer for anonymizing sensitive (e.g., production data), the anonymization preserving the format of the production data. The anonymized data can be relied upon by users (e.g., ETL engineers, performance engineers) including employees of the enterprise, or downstream applications, etc. The anonymization process involves a tokenizer, which can be access-restricted in a manner similar to the production data, generating a token table that has a plurality of replacement token mappings. The token table is used to process sensitive data (e.g., a subset used to create a test sample) and map the replacement of the sensitive data with replacement tokens. The token table can include a plurality of mappings or mapping sets. For example, the mapping can replace contiguous number sequences (contiguous letter sequences) with unique randomly generated contiguous number sequences (randomly generated letter sequences), creating replacement data of the same length and same format. The token table can include mapping sets for sensitive data (e.g., credit card numbers) and mapping sets for date/time data, with all mappings preserving the format of d