US-20260127415-A1 - SYSTEMS AND METHODS FOR UTILITY-PRESERVING PRIVATE ATTRIBUTE SUPPRESSION BASED ON STOCHASTIC DATA SUBSTITUTION

US20260127415A1US 20260127415 A1US20260127415 A1US 20260127415A1US-20260127415-A1

Abstract

A method may include: receiving, by a computer program executed by an electronic device, an original sample to process; extracting, by the computer program, a feature from the original sample using a trained neural network, wherein the neural network may be trained to extract features from samples; calculating, by the computer program, a probability of substituting the original sample with each sample of a plurality of samples in a dataset; substituting, by the computer program, the original sample with a sample in the dataset based on the calculated probability; and returning, by the computer program, the substituted sample, wherein sensitive attributes of the original sample cannot be inferred from the substituted sample, while useful attributes of the original sample may be inferred from the substituted sample.

Inventors

Yizhuo CHEN
Richard Chen
Shaohan Hu
Hsiang HSU

Assignees

JPMORGAN CHASE BANK, N.A.

Dates

Publication Date: 20260507
Application Date: 20241106

Claims (20)

1 . A method, comprising: receiving, by a computer program executed by an electronic device, a training dataset comprising a plurality of samples, wherein the training dataset comprises sensitive attributes and useful attributes, and each sample comprises a plurality of samples; drawing, by the computer program, a subset of the plurality of samples from the training dataset as substitute dataset; simultaneously training, by the computer program, a learnable embedding for each sample in each sample in the substitute dataset and a neural network to extract a feature for each sample from each sample in the training dataset, wherein the neural network and the learnable embedding are trained using a loss function; and calculating, by the computer program, a probability distribution that is parameterized by the trained neural network using a cosine similarity between each feature for each sample and the learnable embedding for a substitute sample for that sample.
2 . The method of claim 1 , wherein the plurality of samples comprises images.
3 . The method of claim 1 , wherein the plurality of samples comprises audio.
4 . The method of claim 1 , wherein the loss function comprises a first loss term associated with suppressing each sensitive attribute in the training dataset, a second loss term associated with protecting each useful attribute in the training dataset, and a third loss function associated with preserving unannotated useful attributes in the training dataset.
5 . The method of claim 4 , wherein the first loss term maximizes a conditional entropy of a substitute sample given sensitive attribute, the second loss term minimizes a cross-entropy between one of the useful attributes and a substitute useful attribute, and the third loss function minimizes a conditional entropy of a substitution probability distribution.
6 . A method, comprising: receiving, by a computer program executed by an electronic device, an original sample to process; extracting, by the computer program, a feature from the original sample using a trained neural network, wherein the neural network is trained to extract features from samples; calculating, by the computer program, a probability of substituting the original sample with each sample of a plurality of samples in a dataset; substituting, by the computer program, the original sample with a sample in the dataset based on the calculated probability; and returning, by the computer program, the substituted sample, wherein sensitive attributes of the original sample cannot be inferred from the substituted sample, while useful attributes of the original sample are inferred from the substituted sample.
7 . The method of claim 6 , wherein the plurality of samples comprises images.
8 . The method of claim 6 , wherein the plurality of samples comprises audio.
9 . The method of claim 6 , wherein unannotated useful attributes of the original sample are inferred from the substituted sample.
10 . The method of claim 6 , wherein the step of calculating, by the computer program, a probability of substituting the original sample with each sample in the dataset uses a substitution probability distribution.
11 . The method of claim 6 , wherein the neural network is trained with a loss function.
12 . The method of claim 11 , wherein the loss function comprises a first loss term associated with suppressing each sensitive attribute in the dataset, a second loss term associated with protecting each useful attribute in the dataset, and a third loss function associated with preserving unannotated useful attributes in the dataset.
13 . The method of claim 12 , wherein the first loss term maximizes a conditional entropy of a substitute sample given sensitive attribute, the second loss term minimizes a cross-entropy between one of the useful attributes and a substitute useful attribute, and the third loss function minimizes a conditional entropy of a substitution probability distribution.
14 . The method of claim 7 , wherein the dataset is a subset of a training dataset on which the neural network is trained.
15 . A non-transitory computer readable storage medium, including instructions stored thereon, which when read and executed by one or more computer processors, cause the one or more computer processors to perform steps comprising: receiving a training dataset comprising a plurality of samples, wherein the training dataset comprises sensitive attributes and useful attributes, and each sample comprises a plurality of samples; drawing a subset of the plurality of samples from the training dataset as substitute dataset; training a learnable embedding for each sample in each sample in the substitute dataset, and a neural network to extract a feature for each sample from each sample in the training dataset, wherein the neural network and the learnable embedding are trained using a loss function; calculating a probability distribution that is parameterized by the trained neural network using a cosine similarity between each feature for each sample and the learnable embedding for a substitute sample for that sample; receiving an original sample to process; extracting a feature from the original sample using the trained neural network; calculating a probability of substituting the original sample with each sample in the substitute dataset; substituting the original sample with a sample in the substitute dataset based on the calculated probability; and returning wherein the sensitive attributes of the original sample cannot be inferred from the substituted sample, while the useful attributes of the original sample are inferred from the substituted sample.
16 . The non-transitory computer readable storage medium of claim 15 , wherein the plurality of samples comprises images.
17 . The non-transitory computer readable storage medium of claim 15 , wherein the plurality of samples comprise audio.
18 . The non-transitory computer readable storage medium of claim 15 , wherein the calculating uses a substitution probability distribution.
19 . The non-transitory computer readable storage medium of claim 15 , wherein unannotated useful attributes of the original sample are inferred from the substituted sample.
20 . The non-transitory computer readable storage medium of claim 15 , wherein the loss function comprises a first loss term associated with suppressing each sensitive attribute in the training dataset, a second loss term associated with protecting each useful attribute in the training dataset, and a third loss function associated with preserving unannotated useful attributes in the training dataset, wherein the first loss term maximizes a conditional entropy of a substitute sample given sensitive attribute, the second loss term minimizes a cross-entropy between one of the useful attributes and a substitute useful attribute, and the third loss function minimizes a conditional entropy of a substitution probability distribution.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention Embodiments generally relate to systems and methods for utility-preserving private attribute suppression based on stochastic data substitution. 2. Description of the Related Art The growth of modern machine learning (ML) services has made data sharing increasingly common. Typically, ML service providers first collect data from users through various sensors and then analyze the data with a model to offer specific services to the user. The collected data, however, often contains sensitive or private information that users do not want to share with the service providers. For instance, a human voice recognition system may necessitate the collection of users' voice recordings, which could inadvertently expose sensitive information such as the users' gender or accent. SUMMARY OF THE INVENTION Systems and methods for protecting private attributes using data-level grouping and randomization are disclosed. According to an embodiment, a method may include: receiving, by a computer program executed by an electronic device, a training dataset comprising a plurality of samples, wherein the training dataset may include sensitive attributes and useful attributes, and each sample may include a plurality of samples; drawing, by the computer program, a subset of the plurality of samples from the training dataset as substitute dataset; training, by the computer program, a learnable embedding for each sample in each sample in the substitute dataset, and a neural network to extract a feature for each sample from each sample in the training dataset, wherein the neural network and the learnable embedding using a loss function; and calculating, by the computer program, a probability distribution that may be parameterized by the trained neural network using a cosine similarity between each feature for each sample and the learnable embedding for a substitute sample for that sample. In one embodiment, the plurality of samples may include images and/or audio. In one embodiment, the loss function may include a first loss term associated with suppressing each sensitive attribute in the training dataset, a second loss term associated with protecting each useful attribute in the training dataset, and a third loss function associated with preserving unannotated useful attributes in the training dataset. The first loss term maximizes a conditional entropy of a substitute sample given sensitive attribute, the second loss term minimizes a cross-entropy between one of the useful attributes and a substitute useful attribute, and the third loss function minimizes a conditional entropy of a substitution probability distribution. According to another embodiment, a method may include: receiving, by a computer program executed by an electronic device, an original sample to process; extracting, by the computer program, a feature from the original sample using a trained neural network, wherein the neural network may be trained to extract features from samples; calculating, by the computer program, a probability of substituting the original sample with each sample of a plurality of samples in a dataset; substituting, by the computer program, the original sample with a sample in the dataset based on the calculated probability; and returning, by the computer program, the substituted sample, wherein sensitive attributes of the original sample cannot be inferred from the substituted sample, while useful attributes of the original sample may be inferred from the substituted sample. In one embodiment, the plurality of samples may include images and/or audio. In one embodiment, unannotated useful attributes of the original sample may be inferred from the substituted sample. In one embodiment, the step of calculating, by the computer program, a probability of substituting the original sample with each sample in the dataset uses a substitution probability distribution. In one embodiment, the neural network may be trained with a loss function. The loss function may include a first loss term associated with suppressing each sensitive attribute in the dataset, a second loss term associated with protecting each useful attribute in the dataset, and a third loss function associated with preserving unannotated useful attributes in the dataset. The first loss term maximizes a conditional entropy of a substitute sample given sensitive attribute, the second loss term minimizes a cross-entropy between one of the useful attributes and a substitute useful attribute, and the third loss function minimizes a conditional entropy of a substitution probability distribution. In one embodiment, the dataset may be a subset of a training dataset on which the neural network may be trained. According to another embodiment, a non-transitory computer readable storage medium may include instructions stored thereon, which when read and executed by one or more computer processors, cause the one or more computer processors to perform step