US-12626249-B1 - System and method for enforcing PII segregation in a distributed data processing system for privacy-preserving AI corpus generation

US12626249B1US 12626249 B1US12626249 B1US 12626249B1US-12626249-B1

Abstract

A system and method are disclosed for generating a privacy-preserving data corpus for Artificial Intelligence (AI) training. The system comprises a relying partner (RP) computing environment and a trusted, independent identity provider (IdP) computing system. Upon a user authentication request, the IdP provides the RP with only a PII-free, persistent pseudonymous identifier (gUserID) for the user. Any authentication artifacts containing Personally Identifiable Information (PII), such as an OAuth token, are programmatically neutralized by the IdP. This is achieved by generating a transient public-private encryption key pair, immediately destroying the private key, and encrypting the PII-laden artifact with the remaining public key, rendering the PII therein permanently irrecoverable. This enforcement of “PII unknowability” at the RP enables the aggregation of pseudonymous user data, linked by the persistent gUserID, from multiple independent RPs into a rich, cross-organizational corpus for AI training, without ever exposing user PII to the RP.

Inventors

Waleed S Nema

Assignees

Waleed S Nema

Dates

Publication Date: 20260512
Application Date: 20250919

Claims (20)

1 . A system for generating a privacy-preserving data corpus, the system comprising: a relying partner (RP) computing system comprising one or more processors and a memory; and a third-party identity provider (IdP) computing system, operated by a legal entity separate from an operator of the RP computing system, the third-party IdP computing system comprising one or more processors and a memory; wherein the third-party IdP computing system is configured to store or have access to a public key of a first cryptographic key pair for which a corresponding private key is permanently unavailable; wherein the third-party IdP computing system is further configured by instructions stored in its memory to, in response to receiving an authentication request for a user from the RP computing system: perform an authentication process that results in an authentication artifact containing personally identifiable information (PII) of the user; generate or retrieve a persistent, PII-free globally unique user identifier (gUserID) associated with the user, wherein the gUserID is configured to identify the user across a plurality of distinct RP computing systems; extract a stable, non-PII unique identifier from the authentication artifact; encrypt the authentication artifact containing the PII using the public key of the first cryptographic key pair, thereby rendering the PII therein permanently irrecoverable; and transmit only the gUserID to the RP computing system as a sole identifier for the user resulting from the authentication process; wherein the RP computing system is configured to: receive the gUserID from the third-party IdP computing system; and store user-generated data in the memory in association with the received gUserID.
2 . The system of claim 1 , wherein the third-party IdP computing system is further configured to: store, in the memory of the third-party IdP computing system, a mapping between the stable, non-PII unique identifier and the gUserID, wherein no other PII from the authentication artifact is stored in the memory of the third-party IdP computing system.
3 . The system of claim 1 , wherein the third-party IdP computing system is further configured to: generate a second cryptographic key pair comprising a public key and a private key, wherein the private key of the second cryptographic key pair is securely stored by the third-party IdP computing system; and prior to said transmitting the gUserID, digitally sign the gUserID using the private key of the second cryptographic key pair; and wherein the RP computing system is further configured to verify a digital signature of the gUserID using the public key of the second cryptographic key pair.
4 . The system of claim 1 , wherein the RP computing system further comprises a personalization engine configured to maintain a user profile associated with the gUserID, the user profile comprising user-generated data and interaction history providing a personalized user experience, wherein the user profile is stored without any PII of the user.
5 . The system of claim 1 , further comprising a third-party payment processor (PP) computing system, operated by a legal entity separate from the operator of the RP computing system and separate from an operator of the third-party IdP computing system, wherein the RP computing system is further configured to generate a third cryptographic key pair comprising a public key and a private key and to transmit the public key of the third cryptographic key pair to the third-party PP computing system; and wherein the third-party PP computing system is configured to: receive, from the RP computing system, a request to process a financial transaction for the user; receive financial PII from the user; upon successful verification of the financial PII, generate a non-PII transaction code; encrypt the non-PII transaction code using the public key of the third cryptographic key pair; and transmit only the encrypted non-PII transaction code to the RP computing system as confirmation of the financial transaction.
6 . The system of claim 1 , wherein the gUserID is a Universally Unique Identifier (UUID).
7 . The system of claim 1 , wherein the RP computing system is further configured to replicate at least a portion of the user-generated data from the memory of the RP computing system to a separate data corpus computing system.
8 . The system of claim 7 , wherein the replication of the at least a portion of the user-generated data is contingent upon receiving an opt-in selection from the user via a user interface.
9 . The system of claim 8 , wherein the opt-in selection comprises a user-selectable choice between contributing raw user-generated data or aggregated user-generated data to the separate data corpus computing system.
10 . The system of claim 9 , wherein the opt-in selection further comprises a user-selectable choice between contributing the user-generated data to a paid-data corpus or an open-data corpus.
11 . The system of claim 10 , wherein the opt-in selection further comprises a user-selectable time duration for the opt-in selection, and wherein the system is configured to programmatically enforce the time duration by automatically revoking access to the user-generated data upon expiration of the time duration.
12 . The system of claim 7 , wherein the data corpus computing system is a cross-organizational data corpus configured to store user-generated data associated with a plurality of gUserIDs received from a plurality of distinct RP computing systems.
13 . The system of claim 12 , further comprising an artificial intelligence (AI) engine computing system communicatively coupled to the cross-organizational data corpus, wherein the AI engine computing system is configured to train a machine learning model using the user-generated data stored within the cross-organizational data corpus.
14 . The system of claim 13 , further comprising a payment processing system and a third-party payment processor (PP) computing system, wherein the third-party IdP computing system is further configured to store a non-PII payment token associated with the gUserID, the non-PII payment token having been previously generated by the third-party PP computing system in response to receiving financial PII from the user; and wherein the payment processing system is configured to: track usage of the user-generated data by the AI engine computing system on a per-gUserID basis; calculate an incentive payment associated with the gUserID based on tracked usage; and instruct the third-party IdP computing system to facilitate transferring the incentive payment to the user; wherein the third-party IdP computing system is further configured to, in response to the instruction, instruct the third-party PP computing system to execute a financial transaction to a financial account of the user corresponding to the non-PII payment token.
15 . The system of claim 13 , wherein the cross-organizational data corpus is structured as a data commons governed by a fiduciary trustee, and wherein the AI engine computing system is granted access to the cross-organizational data corpus only upon programmatic execution of a data use agreement comprising a non-exclusivity clause.
16 . A computer-implemented method for enforcing segregation of personally identifiable information (PII) in a distributed data processing environment, the method comprising: providing a third-party identity provider (IdP) computing system that stores or has access to a public key of a first cryptographic key pair for which a corresponding private key is permanently unavailable; receiving, at the third-party IdP computing system, an authentication request for a user, the authentication request originating from a relying partner (RP) computing system that is operated by a legal entity separate from an operator of the third-party IdP computing system; performing, by the third-party IdP computing system, an authentication process that generates an authentication artifact containing PII of the user; generating or retrieving, by the third-party IdP computing system, a persistent, PII-free globally unique user identifier (gUserID) associated with the user, the gUserID being configured for use across multiple, distinct RP computing systems; encrypting, by the third-party IdP computing system, the authentication artifact containing the PII with the public key of the first cryptographic key pair, wherein said encrypting renders the PII within the authentication artifact permanently irrecoverable; transmitting, from the third-party IdP computing system to the RP computing system, the gUserID as a sole identifier for the user provided in response to the authentication request; and storing, by the RP computing system, user-generated data in association with the gUserID transmitted by the third-party IdP computing system.
17 . The method of claim 16 , further comprising: prior to said encrypting the authentication artifact, extracting, by the third-party IdP computing system, a stable, non-PII unique identifier from the authentication artifact; and storing, by the third-party IdP computing system, a mapping between the extracted stable, non-PII unique identifier and the gUserID in a persistent memory of the third-party IdP computing system, wherein no other PII from the authentication artifact is stored in the persistent memory of the third-party IdP computing system.
18 . The method of claim 16 , further comprising: generating, by the third-party IdP computing system, a second cryptographic key pair comprising a public key and a private key, and securely storing the private key of the second cryptographic key pair; prior to said transmitting the gUserID, digitally signing, by the third-party IdP computing system, the gUserID using the private key of the second cryptographic key pair; and verifying, by the RP computing system, a digital signature of the gUserID using the public key of the second cryptographic key pair.
19 . The method of claim 16 , further comprising: maintaining, by the RP computing system, a user profile associated with the gUserID; and utilizing the user profile to provide a personalized user experience to the user, wherein the user profile is stored and utilized without any PII of the user.
20 . The method of claim 16 , further comprising: receiving, at a third-party payment processor (PP) computing system, a request from the RP computing system to process a financial transaction for the user, wherein the third-party PP computing system is operated by a legal entity separate from an operator of the RP computing system; receiving, at the third-party PP computing system, financial PII from the user; generating, by the third-party PP computing system, a non-PII transaction code upon successful verification of the financial PII; encrypting, by the third-party PP computing system, the non-PII transaction code using a public key of a cryptographic key pair previously provided by the RP computing system; and transmitting, from the third-party PP computing system to the RP computing system, only the encrypted non-PII transaction code as confirmation of the financial transaction.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS 1. This application claims the benefit of the following two provisional patent applications: a. Provisional Application No. 63/723,691 submitted on Nov. 22, 2024 11:11:32 AM Z ET Title: PSEUDONYMOUS USER AI CORPORA, BUSINESS MODEL AND METHODSb. Provisional Application No. 63/872,467 submitted on Aug. 29, 2025 10:04:49 AM Z ET Title: A System and Method for Enforcing PII Segregation in a Distributed Data 1 Processing System for Privacy-Preserving AI Corpus Generation 2. This application is a continuation-in-part of U.S. patent application Ser. No. 19/315,444, filed on Aug. 29, 2025 10:32:09 PM Z ET, the disclosure of which is incorporated herein by reference in its entirety. FEDERALLY SPONSORED RESEARCH AND DEVELOPMENT Not Applicable REFERENCE TO A “SEQUENCE LISTING”, A TABLE, OR A COMPUTER PROGRAM LISTING APPENDIX SUBMITTED ON A COMPACT DISC AND AN INCORPORATION-BY-REFERENCE OF THE MATERIAL ON THE COMPACT DISC Not Applicable REFERENCES CITED U.S. Pat. No. 8,281,149 B2: “Private user-controlled pseudonymous access to a relying party”U.S. Patent No. US 2022/0147654 A1: “Data anonymization” The present invention relates generally to the field of computer data security and privacy. More specifically, it pertains to systems and methods for creating large-scale datasets for training artificial intelligence models while architecturally preventing the disclosure of personally identifiable information to the service providers collecting the data. The advancement of digital services, particularly in areas like artificial intelligence (AI) and user personalization, is fundamentally dependent on the availability of vast and detailed datasets. AI models require massive, diverse, and often longitudinal data corpora for effective training, while personalized user experiences rely on the continuous tracking and analysis of individual user behavior over time. This dependency has created an inherent and unresolved conflict between technological utility and user privacy. Conventional system architectures address this need by directly linking user-generated data and behavioral analytics to a user's Personally Identifiable Information (PII). A stable, PII-linked identifier (such as an email address or account name) is used as the primary key for aggregating a user's history, preferences, and interactions. While this approach enables powerful personalization and data aggregation, it comes at a severe cost to privacy. It exposes users to significant risks, including data breaches that can reveal sensitive personal information, the potential for unauthorized surveillance, and the chilling effect on user expression that comes with the knowledge of being constantly monitored and profiled. The core of the problem lies in a fundamental architectural limitation of these conventional systems: the tight and often inseparable coupling of a persistent user identifier (which is essential for personalization) with the user's real-world identity (PII). This architecture forces users and service providers into a false dichotomy. Users must choose between surrendering their privacy to receive a rich, personalized experience or protecting their identity at the cost of receiving a generic, impersonal, and less useful service. Consequently, there exists a significant and unmet need for a new technical framework that can fundamentally decouple these two concepts-a system that can support deep, persistent user modeling for advanced personalization and data aggregation while simultaneously making the user's real-world identity technically unknowable to the service provider. PRIOR ART The defense of this invention's novelty and non-obviousness rests on a unique combination of technical features that, when viewed as a whole, represent a significant departure from prior art solutions. The primary arguments against the most relevant fields of art are as follows: 1. Distinction from Federated Learning (FL) An examiner may cite prior art related to Federated Learning, as it is a known method for privacy-preserving AI.a. Argument: While both the present invention and Federated Learning (FL) address the problem of training AI models on sensitive data, they do so through fundamentally different and mutually exclusive technical architectures. i. FL follows a “model-to-the-data” approach. In FL, a global model is sent out to be trained on decentralized client devices where the raw data resides. Only the model updates (e.g., gradients) are sent back to a central server; the raw data is never centralized.ii. The present invention follows a “data-to-the-model” approach. It creates a centralized, cross-organizational corpus of pseudonymous data, which is then used to train an AI model in a more traditional manner. b. Inventive Distinction: This architectural divergence is not an obvious design choice but a specific technical trade-off that yields a different and