US-12626783-B2 - Computer-implemented method and apparatus for analysing genetic data

US12626783B2US 12626783 B2US12626783 B2US 12626783B2US-12626783-B2

Abstract

The disclosure relates to analysing genetic data. In one arrangement, a method operates on input data comprising strengths of association between one or more phenotypes including a target phenotype and a plurality of genetic variants. A fine-mapping algorithm is applied to all or a subset of the input data to identify one or more independent phenotype-variant associations. A set of one or more fine-mapped variants is identified for each association. A fine-mapping predictive model is calculated on the basis of the input data and the set of fine-mapped variants. The effect on the target phenotype of the set of fine-mapped variants is subtracted from the input data to obtain residual association data. A machine learning algorithm is applied to the residual association data to identify further predictive correlations between the target phenotype and the plurality of genetic variants.

Inventors

Vincent Yann Marie PLAGNOL
Rachel Moore
Eva Maria Laura KRAPOHL
Christopher Charles Alan SPENCER

Assignees

GENOMICS LIMITED

Dates

Publication Date: 20260512
Application Date: 20200828
Priority Date: 20190828

Claims (20)

1 . A computer-implemented method of analysing genetic data about an organism to obtain information about the organism, the method comprising: receiving input data comprising strengths of association between one or more phenotypes including a target phenotype and a plurality of genetic variants in a region of interest of the genome of the organism; applying a fine-mapping algorithm to all or a subset of the input data to identify one or more independent phenotype-variant associations within the region of interest, comprising identifying for each association a set of one or more fine-mapped variants from the plurality of genetic variants, and determining for each fine-mapped variant an estimated probability of being causal for the phenotype-variant association, the sum of the probabilities for the fine-mapped variants within the set adding to one; generating, on the basis of the input data and the set of one or more fine-mapped variants, a fine-mapping predictive model quantifying an effect on the target phenotype of the set of one or more fine-mapped variants; subtracting from the input data, using the fine-mapping predictive model, the effect on the target phenotype of the set of one or more fine-mapped variants to obtain residual association data, wherein the subtracting comprises subtracting a weighted sum of effect sizes from an estimated effect size of each of the plurality of genetic variants on the target phenotype to obtain a residual effect size for each of the plurality of genetic variants, and wherein the residual association data comprises the residual effect size for each of the plurality of genetic variants; inputting, into a machine learning algorithm, at least the residual association data; outputting, by the machine learning algorithm, predicted weight values for non-fine mapped variants, wherein the predicted weight values indicate a significance assigned to the non-fine mapped variants based on residual signals, while accounting for correlation between the non-fine mapped variants, wherein the non-fine mapped variants are variants included in the plurality of genetic variants but are not identified by the fine-mapping algorithm as the one or more fine-mapped variants, and wherein the outputting comprises iterating through multiple selections of variants from the plurality of genetic variants and, as the variants are selected, estimating the residual signal for each of the variants based on the residual association data; generating a polygenic risk score model based on the fine-mapping predictive model and the predicted weight values for the non-fine mapped variants; and applying the polygenic risk score model to genetic data from an individual to determine a polygenic risk score for the individual for the target phenotype.
2 . The method of claim 1 , wherein the strengths of association comprise an estimated effect size of each of the plurality of genetic variants on the target phenotype, and a standard error of each of the estimated effect sizes.
3 . The method of claim 1 , wherein the step of receiving input data comprises: receiving individual level data comprising genotypes and corresponding phenotypes for each of a plurality of individuals; and determining using the individual level data an estimated effect size of each of the plurality of genetic variants on the target phenotype and a standard error of each of the estimated effect sizes.
4 . The method of claim 1 , wherein the identifying of the set of one or more fine-mapped variants is performed using an iterative method, wherein each iteration comprises: identifying, on the basis of the input data, a fine-mapped variant within the region of the genome different from any previously identified fine-mapped variant; updating the input data to account for the effect on the target phenotype of the fine-mapped variants already identified, using a matrix of correlations between the genetic variants within the region of the genome; and determining whether to perform a further iteration on the basis of the updated input data.
5 . The method of claim 1 , wherein the identifying of the set of one or more fine-mapped variants comprises using a plurality of instrument traits known to affect the target phenotype, the use of the instrument traits comprising: determining an initial set of fine-mapped variants for the instrument traits; and determining whether to include each fine-mapped variant of the initial set of fine-mapped variants for the instrument traits in the set of one or more fine-mapped variants for the target phenotype on the basis of a relationship between the plurality of instrument traits and the target phenotype.
6 . The method of claim 5 , wherein the generating of the fine-mapping predictive model comprises: determining effect sizes on the one or more instrument traits of the initial set of fine-mapped variants for the one or more directly causal instrument traits, and determining an effect size for the target phenotype of each fine-mapped variant of the initial set of fine-mapped variants for the instrument traits included in the set of one or more fine-mapped variants for the target phenotype on the basis of a predetermined relationship between effect sizes for the instrument traits and effect sizes for the target phenotype.
7 . The method of claim 1 , wherein the identifying of the set of one or more fine-mapped variants comprises identifying an initial set of fine-mapped variants for one or more directly causal instrument traits known to affect the target phenotype.
8 . The method of claim 1 , wherein: the strengths of association comprise an estimated effect size of each of the plurality of genetic variants on the target phenotype, and a standard error of each of the estimated effect sizes; and the fine-mapping predictive model comprises a fine-mapped effect size on the target phenotype for each of the fine-mapped variants, the fine-mapped effect size being calculated from the estimated effect size of the fine-mapped variants taking account of the estimated probability of the fine-mapped variants being causal for the phenotype-variant association.
9 . The method of claim 1 , wherein the effect on the target phenotype of the set of one or more fine-mapped variants is inferred using a machine learning algorithm.
10 . The method of claim 9 , wherein the set of one or more fine-mapped variants further comprises one or more variants known to have a high likelihood of being causal for the target phenotype.
11 . The method of claim 1 , wherein: the strengths of association comprise an estimated effect size of each of the plurality of genetic variants on the target phenotype, and a standard error of each of the estimated effect sizes; and the step of subtracting from the input data the effect on the target phenotype of the set of one or more fine-mapped variants comprises obtaining the residual effect size for each of a plurality of the genetic variants in the input data, the residual association data comprising the residual effect sizes, wherein, after appropriate renormalisation of the effect sizes to ensure equal variance, the residual effect size {circumflex over (β)} i for genetic variant i is given by: β ^ i = β i - ∑ j = 1 N p j ⁢ r ij ⁢ β ~ j where β i is the estimated marginal effect size of genetic variant i, N is the number of fine-mapped variants, p j is the probability that variant j is causal, {tilde over (β)}j is the fine-mapped effect size of the j th fine-mapped variant on the target phenotype, and r ij is a correlation between the j th fine-mapped variant and genetic variant i.
12 . The method of claim 1 , wherein the input data are derived from a plurality of different genetic studies, and the step of inputting into the machine learning algorithm comprises using a prior probability for each of the plurality of genetic variants of being causal for the target phenotype that is dependent on the consistency of the strength of association between each genetic variant and the target phenotype between the different genetic studies.
13 . The method of claim 1 , wherein the step of inputting into the machine learning algorithm comprises using a prior probability for each of the plurality of genetic variants of being causal for the target phenotype that is dependent on genomic annotations of the plurality of genetic variants in the region of interest.
14 . The method of claim 1 , wherein the step of applying the polygenic risk score model to genetic data from the individual further comprises applying the fine-mapping predictive model and the non-fine mapped variants identified by the machine learning algorithm.
15 . The method of claim 14 , wherein the polygenic risk score is given by the followed weighted sum: PRS = ∑ l = 1 L α l ⁢ x l where L is the number of variants that contribute to the PRS, each variant being included either in the fine-mapping predictive model or in the non-fine mapped variants from the machine learning algorithm, α l quantifies a strength of association of variant l on the target phenotype, the strength of association being specified by the fine-mapping predictive model or by the non-fine mapped variants from the machine learning algorithm, and x l is the genotype for variant l.
16 . The method of claim 14 , wherein the polygenic risk score for the individual is derived from a combination of a first partial polygenic risk score provided by applying the fine-mapping predictive model to genetic data from the individual and a second partial polygenic risk score provided by applying the non-fine mapped variants of the machine learning algorithm to the genetic data from the individual.
17 . The method of claim 1 , wherein the input data are derived from a plurality of different populations of the organism, and either or both of the following is satisfied: the generating of the fine-mapping predictive model is performed separately for portions of the input data corresponding to different populations to obtain multiple respective population-matched fine-mapping predictive models; and the inputting into the machine learning algorithm at least the residual association data is performed separately for portions of the input data corresponding to different populations to obtain multiple respective sets of population-matched further predictive correlations.
18 . The method of claim 17 , further comprising: receiving input data from an individual having genes from a mixture of the different populations; and generating a polygenic risk score for the individual by performing either or both of: matching each of multiple population-matched fine-mapping predictive models to a corresponding portion of the input data that matches the population of the population-matched fine-mapping predictive model and applying each matched fine-mapping predictive model to the corresponding portion of the input data; and matching each of multiple sets of population-matched further predictive correlations to a corresponding portion of the input data that matches the population of the set of population-matched further predictive correlations and applying each population-matched set of further predictive correlations to the corresponding portion of the input data.
19 . The method of claim 18 , wherein the matching of each of multiple sets of population-matched further predictive correlations is performed and the matching of each of multiple population-matched fine-mapping predictive models is not performed, the generation of the polygenic risk score further comprising applying a shared population-consistent fine-mapping predictive model to the input data from the individual.
20 . The method of claim 17 , further comprising: receiving input data from an individual having genes predominantly from one of the different populations; and generating a polygenic risk score for the individual by performing either or both of: applying a population-matched fine-mapping predictive model to all of the input data from the individual, the population-matched fine-mapping predictive model being matched to the population of the individual; and applying a set of population-matched further predictive correlations to all of the input data from the individual, the set of population-matched further predictive correlations being matched to the population of the individual.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a U.S. National Stage of International Application No. PCT/GB2020/052060, filed Aug. 28, 2020, which claims priority to Great Britain Application No. 1912331.4, filed Aug. 28, 2019, which are hereby incorporated by reference in their entireties for all purposes. The invention relates to analysing genetic and phenotype data about an organism to obtain information about the organism, particularly in the context of enabling improved polygenic risk scores (PRSs) to be obtained for phenotypes of interest. A PRS is a quantitative summary of the contribution of an organism's inherited DNA to the phenotypes that it may exhibit. A PRS may include all DNA variants relevant (either directly or indirectly) to a phenotype of interest or may use its component parts if these are more relevant to a particular aspect of an organism's biology (including cells, tissues, or other biological units, mechanisms or processes). A PRS can be used directly, or as part of a plurality of measurements or records about the organism, to infer aspects of its past, current, and future biology. In the context of improving human health and healthcare, PRSs have a range of practical uses, which include, but are not limited to: predicting the risk of developing a disease or phenotype, predicting age of onset of a phenotype, predicting disease severity, predicting disease subtype, predicting the response to treatment, selecting appropriate screening strategies for an individual, selecting appropriate medication interventions, and setting prior probabilities for other prediction algorithms. PRS may have direct use as a source of input in the application of artificial intelligence and machine learning approaches to making predictions or classifications from other high dimensional input data (for example imaging). They may be used to help train these algorithms, for example to identify predictive measurements based on non-genetic data. As well as having utility in making predictive statements about an individual, they can also be used to identify cohorts of individuals, included but not limited to the above applications, by calculating the PRS for a large number of individuals, and then grouping individuals on the basis of the PRSs. PRSs can also aid in the selection of individuals for clinical trials, for example to optimise trial design by recruiting individuals more likely to develop the relevant disease or phenotypes, thereby enhancing the assessment of the efficacy of a new treatment. PRSs carry information about the individuals they are calculated for, but also for their relatives (who share a fraction of their inherited DNA). Information about the impact of an individual's DNA on their phenotypes can derive from any relevant assessment of the potential impact of carrying any particular combination of DNA variants. In what follows we focus on the analysis of the recent wealth of information that derives from genetic association studies (GAS). These studies systematically assess the potential contribution of DNA variants to the genetic basis of a phenotype. Since the mid-2000s, GAS (typically genome-wide association studies: GWAS, or association studies targeting single variants, or variants in a region of the genome, or GWAS restricted to a particular region of the genome) have been conducted on many thousands of (largely human) phenotypes, in millions of individuals, generating billions of potential links between genotypes and phenotypes. The resulting raw data is often then simplified to produce summary statistic data. GAS summary statistic data consists of, for each genetic variant (whether imputed or observed), the inferred effect size of the genetic variant on the phenotype of the GAS and the standard error of the inferred effect size. In other cases the individual level data, consisting of a full genetic profile of the individuals in a study and information about their phenotypes, may be available directly. However, individual level data is typically less widely available due to requirements on the privacy of an individual's data. In what follows, we refer to a phenotype as being synonymous with a single study. However, it is quite often the case that data are available from multiple different studies on the same or similar phenotypes, or from a single cohort from which multiple different phenotypes are measured. A PRS consists of the aggregation of the effects of a large number of genetic variants, typically each having small individual effects, to build an aggregate predictor for a trait of interest. Variants included in such a score can either be “causal variants”, in the sense that the variants directly affect a trait (weakly, but directly), or “tag variants”, which means that they are strongly correlated with other, unknown, variants that are causal, but that the tag variant itself does not have a direct effect on the phenotype. PRSs can be calculated using either individu