US-12619778-B2 - Token-based data security systems and methods with cross-referencing tokens in freeform text within structured document

US12619778B2US 12619778 B2US12619778 B2US 12619778B2US-12619778-B2

Abstract

Multiple types of tokens can be generated and utilized in a highly structured document with freeform text. For example, a tokenization system may receive a request for tokenizing a document with a first portion having structured content and a second portion having unstructured or semi-structured content. In response, the tokenization system identifies sensitive information in the first portion of the document, generates format-preserving tokens for the sensitive information in the first portion of the document, identifies sensitive information in the second portion of the document, and generates self-describing tokens for the sensitive information in the second portion of the document. The self-describing tokens reference the sensitive information in the first portion of the document. The tokenization system may then communicate the format-preserving tokens and the self-describing tokens to the first client computing system or to a second client computing system.

Inventors

Walter Hughes Lindsay

Assignees

OPEN TEXT INC.

Dates

Publication Date: 20260505
Application Date: 20231222

Claims (20)

1 . A method for securing data, the method comprising: receiving, by a server computer from a first client computer, a request for tokenizing a document, the document having structured content and less-structured content; identifying, by the server computer, sensitive information in the structured content and the less-structured content; generating, by the server computer, a format-preserving token for the sensitive information identified in the structured content; generating, by the server computer, a self-describing token for the sensitive information identified in the less-structured content, wherein generating the self-describing token comprises embedding the format-preserving token as a body within the self-describing token, thereby creating a reference to the sensitive information in the structured content; replacing, by the server computer, the sensitive information in the structured content with the format-preserving token and the sensitive information in the less-structured content with the self-describing token, wherein the self-describing token differs from the format-preserving token; and communicating, by the server computer, the document with the format-preserving token and the self-describing token to the first client computer or to a second client computer.
2 . The method according to claim 1 , wherein the format-preserving token has a one-to-one relationship with the sensitive information in the structured content.
3 . The method according to claim 1 , wherein the self-describing token embeds a protection strategy that specifies a technique for generating or formatting a surrogate for the sensitive information and for mapping between the surrogate and the sensitive information.
4 . The method according to claim 1 , wherein the self-describing token has a preconfigured pattern and a token value.
5 . The method according to claim 1 , wherein the self-describing token has a preconfigured pattern, a protection strategy indicator, and a token value.
6 . The method according to claim 1 , further comprising: marking the self-describing token with a visual marker in a human-readable form.
7 . The method according to claim 1 , wherein the structured content comprises a data field and wherein the less-structured content comprises freeform text in the data field.
8 . A tokenization system for securing data, the tokenization system comprising: a processor; a non-transitory computer-readable medium; and instructions stored on the non-transitory computer-readable medium and translatable by the processor for: receiving, from a first client computer, a request for tokenizing a document, the document having structured content and less-structured content; identifying sensitive information in the structured content and the less-structured content; generating a format-preserving token for the sensitive information identified in the structured content: generating, by the server computer, a self-describing token for the sensitive information identified in the less-structured content, wherein generating the self-describing token comprises embedding the format-preserving token as a body within the self-describing token, thereby creating a reference to the sensitive information in the structured content; replacing the sensitive information in the structured content with the format-preserving token and the sensitive information in the less-structured content with the self-describing token, wherein the self-describing token differs from the format-preserving token; and communicating the document with the format-preserving token and the self-describing token to the first client computer or to a second client computer.
9 . The tokenization system of claim 8 , wherein the format-preserving token has a one-to-one relationship with the sensitive information in the structured content.
10 . The tokenization system of claim 8 , wherein the self-describing token embeds a protection strategy that specifies a technique for generating or formatting a surrogate for the sensitive information and for mapping between the surrogate and the sensitive information.
11 . The tokenization system of claim 8 , wherein the self-describing token has a preconfigured pattern and a token value.
12 . The tokenization system of claim 8 , wherein the self-describing token has a preconfigured pattern, a protection strategy indicator, and a token value.
13 . The tokenization system of claim 8 , wherein the instructions are further translatable by the processor for: marking the self-describing token with a visual marker in a human-readable form.
14 . The tokenization system of claim 8 , wherein the structured content comprises a data field and wherein the less-structured content comprises freeform text in the data field.
15 . A computer program product for tokenization, the computer program product comprising a non-transitory computer-readable medium storing instructions translatable by a processor for: receiving, from a first client computer, a request for tokenizing a document, the document having structured content and less-structured content; identifying sensitive information in the structured content and the less-structured content; generating a format-preserving token for the sensitive information identified in the structured content: generating, by the server computer, a self-describing token for the sensitive information identified in the less-structured content, wherein generating the self-describing token comprises embedding the format-preserving token as a body within the self-describing token, thereby creating a reference to the sensitive information in the structured content; replacing the sensitive information in the structured content with the format-preserving token and the sensitive information in the less-structured content with the self-describing token, wherein the self-describing token differs from the format preserving token; and communicating the document with the format-preserving token and the self-describing token to the first client computer or to a second client computer.
16 . The computer program product of claim 15 , wherein the self-describing token embeds a protection strategy that specifies a technique for generating or formatting a surrogate for the sensitive information and for mapping between the surrogate and the sensitive information.
17 . The computer program product of claim 15 , wherein the self-describing token has a preconfigured pattern and a token value.
18 . The computer program product of claim 15 , wherein the self-describing token has a preconfigured pattern, a protection strategy indicator, and a token value.
19 . The computer program product of claim 15 , wherein the instructions are further translatable by the processor for: marking the self-describing token with a visual marker in a human-readable form.
20 . The computer program product of claim 15 , wherein the structured content comprises a data field and wherein the less-structured content comprises freeform text in the data field.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of, and claims a benefit of priority under 35 U.S.C. § 120 from, U.S. patent application Ser. No. 17/460,092, filed Aug. 27, 2021, issued as U.S. Pat. No. 11,893,136, entitled “TOKEN-BASED DATA SECURITY SYSTEMS AND METHODS WITH CROSS-REFERENCING TOKENS IN FREEFORM TEXT WITHIN STRUCTURED DOCUMENT,” which claims a benefit of priority under 35 U.S.C. § 119 (e) from U.S. Provisional Application No. 63/071,618, filed Aug. 28, 2020, entitled “TOKEN-BASED DATA SECURITY SYSTEMS AND METHODS,” both of which are fully incorporated by reference herein for all purposes. This application relates to a co-pending U.S. patent application Ser. No. 17/460,007, filed Aug. 10, 2021, entitled “TOKEN-BASED DATA SECURITY SYSTEMS AND METHODS FOR STRUCTURED DATA,” a co-pending U.S. patent application Ser. No. 17/460,040, filed Aug. 10, 2021, entitled “TOKEN-BASED DATA SECURITY SYSTEMS AND METHODS WITH EMBEDDABLE MARKERS IN UNSTRUCTURED DATA,” and a co-pending U.S. patent application Ser. No. 17/460,094, filed Aug. 10, 2021, entitled “TOKENIZATION SYSTEMS AND METHODS FOR REDACTION.” All applications listed in this paragraph are hereby incorporated by reference herein. TECHNICAL FIELD This disclosure relates generally to data security in data processing. More particularly, this disclosure relates to data tokenization for protecting sensitive data. Even more particularly, this disclosure relates to data security systems, methods, and computer program products for creating and utilizing various types of tokens, including format-preserving, self-describing, and patterned tokens, to protect sensitive data in content, including structured content and unstructured content. BACKGROUND OF THE RELATED ART In data security, the term “token” refers to a non-sensitive data element that can be used as a surrogate in place of a sensitive data element. In general, a token has no extrinsic or exploitable meaning or value, other than serving as a reference to the sensitive data element when processed through a tokenization system. Generally, a tokenization system is a computing system that is responsible for creating a token, using methods such as a random number generation method that cannot be reverse-engineered, and for detokenizing the token back to the sensitive data element. A data processing application communicatively connected to the tokenization system may, in processing a data file, a document, or a data record, request the tokenization system to generate tokens and replace sensitive data values in the data file, the document, or the data record with the tokens before producing a processed output. This approach has generally been used in the Payment Card Industry (PCI) and electronic medical records (EMRs) applications. As an example, sensitive data can be sent, via an application programming interface (API) call or batch file, from a data processing application to a tokenization provider's system. The tokenization provider's system then generates tokens, stores the original data in a secure token vault, and returns desensitized data in which the original sensitive data is replaced with an unrelated value of the same length and format. The tokens can retain elements of the original data. However, unlike encrypted data, tokenized data is undecipherable and irreversible. Because there is no mathematical relationship between a token and the original data for which it replaces, the token cannot be transformed back to its original form. SUMMARY OF THE DISCLOSURE Since a token traditionally has no extrinsic or exploitable meaning or value, its use across various types of data security applications is generally limited. Embodiments disclosed herein are directed to new types of data security tokens that can be used in various data security systems, methods, and computer program products. The tokens can be created and utilized for protecting sensitive data in structured content as well as unstructured content. This disclosure describes example embodiments of data security through data tokenization from the following aspects. According to a first aspect, format-preserving tokens can be generated and utilized in tokenizing sensitive data values in structured data and the sensitive data values can be manipulated and later revealed in an anonymizing mapping revealing (“AMR”) process. In some embodiments, a method for securing data can include receiving, by a tokenization system from a first client computing system, a request for data anonymization, the request referencing structured data containing values of interest. The tokenization system can perform a tokenization operation on the structured data which can include generating, for a value of interest in the structured data, a corresponding token and replacing the value of interest in the structured data with the corresponding token, thereby producing an anonymized version of the structured data. The tokenization system can sto