EP-4070219-B1 - NATURAL PSEUDONYMIZATION AND DOWNSTREAM PROCESSING
Inventors
- YAROWSKY, DAVID E.
- HAMBURGER, Marc
- Weiser, Octavian
Dates
- Publication Date
- 20260506
- Application Date
- 20201202
Claims (9)
- A computer system comprising: one or more computer processors; one or more computer readable memories in communication with the one or more computer processors and having instructions stored thereon that, when executed by the one or more computer processors, cause the one or more processors to perform operations comprising receive a data stream of processed text data; reidentify (270B), by the one or more processors, a piece of sensitive text information in the processed text data using a pseudonym table (145; 222B), wherein the pseudonym table (145; 222B) contains a mapping of the piece of sensitive text information with a natural pseudonym, wherein the natural pseudonym has at least one information attribute that is the same as a corresponding information attribute of the piece of sensitive text information such that the natural pseudonym is difficult to distinguish from the sensitive text information in the data stream, wherein the reidentifying step (270B) comprises modifying the processed text data by replacing the natural pseudonym with the piece of sensitive text information, characterised in that the sensitive text information includes a first date range, wherein the natural pseudonym has a second date range having a same duration as the first date range.
- The computer system of claim 1, wherein the at least one information attribute comprises at least one of a gender, an age, an ethnicity, an information type, a number of letters, a capitalization pattern, a geographic origin, and street address characteristics of a location.
- The computer system of claim 1 or 2, wherein the natural pseudonym has at least two information attributes that are the same as corresponding information attributes of the piece of sensitive text information.
- The computer system of any one of claims 1-3, wherein the piece of sensitive text information is a piece of personal identifiable information.
- The computer system of any one of claims 1-4, wherein the piece of sensitive text information is a piece of sensitive health information.
- The computer system of any of claims 1-5, wherein the natural pseudonym has a same number of letters as the piece of sensitive text information.
- The computer system of any of claims 1-6, wherein the sensitive text information is a person's name, and wherein the natural pseudonym is a person's name that is different from the sensitive text information.
- The computer system of any of claims 1-7, wherein the natural pseudonym enables a same downstream processing behavior as the sensitive text information.
- The computer system of any of claims 1-8, wherein the natural pseudonym cannot be distinguished from the sensitive text information by a reader.
Description
Technical Field The present disclosure is related to systems and methods for data pseudonymization and obfuscation and using data pseudonymization and obfuscation. Brief Description of Drawings The accompanying drawings are incorporated in and constitute a part of this specification and, together with the description, explain the advantages and principles of the invention. In the drawings, Figure 1A illustrates a system diagram of one embodiment of a natural pseudonymization system 100; and Figures 1B-1E provide some example implementations of components illustrated in Figure 1A;Figure 2A illustrates a flowchart of one embodiment of a natural pseudonymization system;Figure 2B illustrates a flowchart to one example of a downstream processing system;Figures 3A-3D illustrate various examples of how natural pseudonymization systems used for downstream processing; andFigure 4A and 4B illustrates one example of input and output of a natural pseudonymization system. In the drawings, like reference numerals indicate like elements. While the above-identified drawings, which may not be drawn to scale, set forth various embodiments of the present disclosure, other embodiments are also contemplated, as noted in the Detailed Description. In all cases, this disclosure describes the presently disclosed disclosure by way of representation of exemplary embodiments and not by express limitations. It should be understood that numerous other modifications and embodiments can be devised by those skilled in the art, which fall within the scope of this disclosure. Detailed Description More and more sensitive and personal data is collected and used. Under various data law and regulations, data collection, processing, and storage require special care. For example, handling and exchanging documents with personal and sensitive health data are subject to data regulations, such as HIPPA ("Health Insurance Portability and Accountability Act") and GDPR ("General Data Protection Regulation"). In some cases, personal data is required to be deidentified. In some cases, sensitive data and documents that clinicians use for training, external or internal audits, testing module functionality or developing new features in hospital information systems needs pseudonymization. In some cases, secure storage of patient information in the cloud or hospital database needs pseudonymization and encryption of data. US 2016/063269 A1 discloses an outsourcing environment by which an outsourcing entity may delegate document-transformation tasks to at least one worker entity, while preventing the worker entity from gaining knowledge of sensitive items that may be contained within a non-obfuscated original document (NOD). The environment may transform the NOD into an obfuscated original document (OOD) by removing sensitive items from the NOD. The worker entity may perform formatting and/or other document-transformation tasks on the OOD, without gaining knowledge of the sensitive items in the NOD, to produce an obfuscated transformed document (OTD). The environment may then allow for the outsourcing entity to view a content-restored version of the OTD. US 2014/280261 A1 discloses a system which includes a software program capable of performing an aliasing function on the personally identifiable information ("PII") of a subject. The software can associate the alias with the PII and output the alias rather than the PII. Various embodiments of the present disclosure are directed to natural pseudonymization. In some cases, deidentification of free text is done in English and various European languages (e.g., German, French, Flemish, English, Spanish, Italian) using natural pseudonymization techniques. Pseudonymization refers to the separation of data from direct identifiers (such as first name or social security number) so that linkage to the original identity is impossible to make without additional information. In some cases, a pseudonymization table is generated and stored separately on highly secure servers for real-time re-identification of patient documents. In some cases, the method includes natural pseudonymization, where sensitive or personal data is replaced by data with the same type, gender and language/region data. For example, a female name is replaced by another female name common in the local culture, or street Gritzenweg is replaced with Rosengasse, throughout the entire document, naturally preserving both context and the type of data. In some embodiments, the method discerns between sensitive health and personal information and general or medical concepts needed for such clinical/medical text analysis applications including clinical decision support, medical study recruiting, clinical utilization management or medical coding by a mixture of word-context, phrase-context, word/phrase-internal, region-context and document-wide statistical models which effectively handling such natural language processing challenges such as complex whitespace, inter-w