EP-4238099-B1 - METHOD OF ANONYMIZING GENOMIC DATA
Inventors
- HULSEN, TIM
- PLETEA, Daniel
Dates
- Publication Date
- 20260506
- Application Date
- 20211022
Claims (17)
- A computer-implemented method for anonymizing a genomic data set, the genomic data set comprising a plurality of alleles arranged in a plurality of single nucleotide polymorphisms (SNPs), the plurality of SNPs comprising one or more phenotype informative SNPs, a phenotype informative SNP being an SNP relating to a phenotypic trait, the genomic data set corresponding to a genome of a person, the method comprising: - receiving (410) the genomic data set; - obtaining (420) a phenotypic probability for at least one phenotype informative SNP, a phenotypic probability being a probability of the phenotypic trait being expressed as a result of the at least one allele corresponding to the at least one phenotype informative SNP, and a proportion of a population which exhibits said phenotypic trait; - computing (430) a re-identification risk score based on the genomic data set, the re-identification risk score indicating a risk of re-identifying the person associated with the genomic data set from the genomic data set, the re-identification risk score being computed from the obtained phenotypic probability and the obtained proportion of the population which exhibits said phenotypic trait; - comparing (440) the re-identification risk score to a threshold risk criterion; - if the re-identification risk score does not meet the threshold risk criterion: - anonymizing the genomic data set by: - selecting (450) a phenotype informative SNP corresponding to the phenotypic traits considered in the calculation of the re-identification risk score, and - masking (460) the selected phenotype informative SNP; and - re-computing (430) the re-identification risk score; - if the re-identification risk score meets the threshold risk criterion: - outputting (470) the anonymized genomic data set.
- The method of claim 1, wherein: - comparing (440) the re-identification risk score to the threshold risk criterion, - anonymizing (450, 460) the genomic data set, and - re-computing (430) the re-identification risk score, are repeated until the re-identification risk score meets the threshold risk criterion.
- The method of claim 1 or claim 2, further comprising encrypting the anonymized genomic data set.
- The method of any preceding claim, wherein computing (430) the re-identification risk score comprises: - for each of at least one phenotypic trait: - calculating a risk term of a phenotype informative SNP, the phenotype informative SNP relating to said phenotypic trait, the risk term being calculated from a genotypic frequency of the phenotype informative SNP and the phenotypic probability of said phenotypic trait associated with the at least one allele of the phenotype informative SNP, the genotypic frequency indicating a frequency of the at least one allele of the phenotype informative SNP in the population, and - obtaining a proportion of the population which exhibits said phenotypic trait; - computing the re-identification risk score from the calculated risk term of each of the at least one phenotypic trait and the proportion of the population obtained for each of the at least one phenotypic trait.
- The method of claim 4, wherein computing the re-identification risk score comprises: - for each of a plurality of phenotypic traits: - obtaining a proportion of the population exhibiting said phenotypic trait; - identifying at least one phenotype informative SNP relating to said phenotypic trait; - calculating a risk term for each of the identified at least one phenotypic SNP; - selecting the SNP having the largest risk term for said phenotypic trait; and - determining a contribution term for said phenotypic trait from the obtained proportion of the population exhibiting said phenotypic trait and the risk term of the selected SNP; and - determining an applicable population value from the contribution term for each of the plurality of phenotypic traits and the population; and - computing the re-identification risk score based on the applicable population value.
- The method of claim 4 or claim 5, wherein selecting (450) the phenotype informative SNP comprises selecting the SNP whose risk term is used to calculate a smallest contribution term, wherein the smallest contribution term of a phenotypic trait is defined as a proportion of the population exhibiting such phenotypic trait combined with a largest risk term (PT_r_max).
- The method of any preceding claim, wherein the one or more phenotype informative SNPs comprises a subset of SNPs having a priority indication, and wherein selecting (450) the SNP comprises selecting a phenotype informative SNP without a priority indication, wherein the priority indication indicates SNPs to be preserved in the anonymized genomic data set.
- The method of claim 7, wherein the subset of SNPs having the priority indication is identified by: - for each SNP of the one or more phenotype informative SNPs: - determine a distance using a genetic pathways network between the SNP and a prespecified SNP of interest; - if the determined distance is within a threshold distance, adding said SNP to the subset of SNPs having the priority indication.
- The method of any preceding claim, wherein masking the selected SNP comprises deleting a data entry in the genomic data set, the data entry representing the selected SNP.
- The method of any preceding claim, further comprising outputting the re-identification risk score.
- The method of any preceding claim, wherein computing the re-identification risk score comprises obtaining, from a database, statistical information regarding a dependency between multiple phenotypic traits, and applying a correction factor proportion of the population (PT_Pop) derived from the statistical information.
- The method of any preceding claim, further comprising: - identifying at least one direct identifier, a direct identifier being a SNP which independently identifies the person; and - masking the identified at least one direct identifier in the genomic data set.
- The method of any preceding claim, wherein the phenotypic trait comprises an exterior phenotypic trait.
- A computer-readable medium comprising transitory or non-transitory data representing instructions which, when executed by a processor system, cause the processor system to perform the computer-implemented method according to any one of claims 1 to 13.
- A system for anonymizing a genomic data set, the genomic data set comprising a plurality of alleles arranged in a plurality of single nucleotide polymorphisms (SNPs) the plurality of SNPs comprising one or more phenotype informative SNPs, a phenotype informative SNP being an SNP relating to a phenotypic trait, the genomic data set corresponding to a genome of a person, the system comprising: - an input/output subsystem (130) configured to: - receive the genomic data set; - obtain a phenotypic probability for at least one phenotype informative SNP, a phenotypic probability being a probability of the phenotypic trait being expressed as a result of the at least one allele corresponding to the at least one phenotype informative SNP, and a proportion of a population which exhibits said phenotypic trait; - a processor subsystem (110) configured to: - compute a re-identification risk score based on the genomic data set, the re-identification risk score indicating a risk of re-identifying the person associated with the genomic data set from the genomic data set, the re-identification risk score being computed from the obtained phenotypic probability and the obtained proportion of the population which exhibits said phenotypic trait; - compare the re-identification risk score to a threshold risk criterion; - if the re-identification risk score does not meet the threshold risk criterion: - anonymize the genomic data set by: - selecting a phenotype informative SNP corresponding to the phenotypic traits considered in the calculation of the re-identification risk score, and - masking the selected phenotype informative SNP; and - re-computing the re-identification risk score; - if the re-identification risk score meets the threshold risk criterion: - output, via the input/output subsystem (130) the anonymized genomic data set.
- The system for anonymizing a genomic data set as claimed in claim 15, wherein the processor subsystem (110) is configured to select (450) the phenotype informative SNP by selecting the SNP whose risk term yields a smallest contribution term, wherein the smallest contribution term of a phenotypic trait is defined as a proportion of the population exhibiting such phenotypic trait combined with a largest risk term (PT_r_max).
- The system for anonymizing a genomic data set as claimed in claim 15, wherein the processor subsystem (110) is configured to select the one or more phenotype informative SNPs as SNPs not having a priority indication, wherein the priority indication indicates SNPs to be preserved in the anonymized genomic data set.
Description
FIELD The presently disclosed subject matter relates to a method for anonymizing a genomic data set and a corresponding system for anonymizing a genomic data set. The presently disclosed subject matter further relates to a computer-readable medium. BACKGROUND Whole genome sequencing is getting cheaper and cheaper, and services like 23andMe and AncestryDNA offer to sequence hundreds of thousands of SNPs for prices around $100. However, as so much genomic information becomes available, concerns for privacy and security grow. Adversaries are increasingly able to combine genotypic and phenotypic information in a variety of ways to de-anonymize genomic databases. An identification attack, for example, is an attack in which the adversary attempts to identify the genotype (among multiple genotypes) that corresponds to a given phenotype. A further type of de-anonymization attack is the perfect matching attack, where the adversary attempts to match multiple phenotypes to their corresponding genotypes. Statistical models may also be used by an adversary to predict phenotypic traits, based on whole-genome sequencing data. Because of current advancements in genomics, the risk of identification of a subject using their genomic data is growing rapidly. Quasi-identifiers, also known as indirect identifiers, are fields in a dataset that can be used in combination with one another to identify individuals. Examples include gender, zip code, birth date, profession and income. While there are many people who share the same gender, birth date or ZIP code, the combination of these for any one person may be unique, particularly if that person resides in a rural area with a small population. Examples of indirect identifiers include phenotypic traits, such as hair color and eye color, among many others. Currently, whole genome sequences can be easily connected to phenotypic traits, making it possible to find out eye color, hair color, skin color, blood type, and the like, and subsequently identify the subject. As progress is made in genomic research, this problem will worsen. Often, users and researchers choose one of these two options: keep all genomic information intact, thereby risking a privacy breach, or remove all potentially identifiable information from the dataset, which limits the usefulness of the data. Published US patent application US 2020/0035332 A1 describes methods and systems for anonymizing genetic data. The methods and systems described therein identify ancestry identification marker (AIM) regions in the genetic data. The AIM regions of the genetic data includes single nucleotide polymorphism (SNP) alleles associated with a population of patients belonging to a certain ancestry. AIM regions which do not contain gene variants associated with a specific disease may then be masked or removed from the genetic data. A problem of the prior art is that there is no guarantee that the resulting genetic data is sufficiently anonymized. Merely masking or removing AIM regions without clinically relevant data may, in some cases, still result in a genetic data set which can re-identify the person. Moreover, the approach of the prior art involves removing data which may contribute in some as-yet unknown way to a particular disease, meaning that there is a possibility that useful information may be lost. Removing more data from the genetic data set increases the risk of losing valuable and relevant information, thereby reducing the usefulness of the data, but preserving more data in the genetic data set increases the risk of the individual being re-identified from their genetic data set. There is therefore an advantage to being able to ensure that the genetic data set is sufficiently anonymized whilst preserving as much information as possible for applications such as research. Quantifying a risk of re-identification and ensuring that the risk that a person can be re-identified from an anonymized genomic data set can therefore improve patient privacy, security, and the amount of information available to researchers in an anonymized genomic data set. Kale Gulce ET AL: "A utility maximizing and privacy preserving approach for protecting kinship in genomic databases", Bioinformatics vol 34, no 2, 2018 teaches how one can use a kinship metric to remove some genomic information of individuals added to a large genomics data set if family members (having similar genomes) are already present in the database. Humbert Mathias et al: "De-anonymizing Genomic Databases Using Phenotypic Traits", Proceedings on Privacy Enhancing Technologies, vol. 2015, no. 2, 1 June 2015 teaches re-identification attacks of genomic information based on externally visible phenotypic traits, the information of which can be collected from other data sources such as social networks. SUMMARY It would be advantageous to preserve as much genomic data as possible for researchers to access, whilst also protecting the privacy and security of the individuals whose data