US-12626780-B2 - Method and system for selecting, managing, and analyzing data of high dimensionality

US12626780B2US 12626780 B2US12626780 B2US 12626780B2US-12626780-B2

Abstract

A system, method and computer program product for analyzing data of high dimensionality (e.g., sequence reads of nucleic acid samples in connection with a disease condition) are provided.

Inventors

Darya Filippova
Anton VALOUEV
Virgil Nicula
Karthik Jagadeesh
M. Cyrus MAHER
Matthew H. Larson
Monica Portela dos Santos Pimentel
Robert Abe Paine Calef

Assignees

Grail, Inc.

Dates

Publication Date: 20260512
Application Date: 20190313

Claims (17)

1 . A method of analyzing sequence reads of nucleic acid samples in connection with a cancer condition, comprising: receiving, at a data collection component of a computer system, a first set of sequence reads of the nucleic acid samples from each healthy subject in a reference group of healthy subjects, wherein the nucleic acid samples comprise a set of cell-free DNA (cfDNA) fragments and wherein the first set of sequence reads is derived from a DNA sequencing process and wherein the first set of sequence reads are not known to contain one or more mutations indicative of the cancer condition; aligning, using a processor of the computer system, each sequence read in the first set of sequence reads to regions in a reference genome; establishing, using the processor and subsequent to the aligning, a low variability filter, wherein the establishing comprises: identifying, using the processor, numbers of sequence reads aligned to each of the regions in the reference genome across each healthy subject in the reference group of healthy subjects; deriving, using the processor and based on the identifying, quantity data associated with each of the regions based on the numbers of sequence reads; calibrating, using the processor, the quantity data for each of the regions; computing, using the processor, reference quantities for each of the regions based on the calibrated quantity data, wherein the reference quantities comprise at least a first reference quantity and a second reference quantity, wherein the first reference quantity corresponds to an average of the calibrated quantity data and wherein the second reference quantity is a standard deviation of the calibrated quantity data; determining, using the processor, a difference between the first reference quantity and the second reference quantity for each of the regions; classifying each of the regions as a high variability region or a low variability region by comparing the difference against a predetermined threshold, wherein the high variability region is defined by the difference being greater than the predetermined threshold and wherein the low variability region is defined by the difference between less than the predetermined threshold; receiving, at the data collection component, a training set of sequence reads from subjects in a training group, wherein the training set includes a second set of sequence reads of nucleic acid samples from healthy subjects and a third set of sequence reads of nucleic acid samples from cancer subjects who are known to have the cancer condition, wherein the second set of sequence reads of nucleic acid samples are not known to contain the one or more mutations indicative of the cancer condition and wherein the third set of sequence reads of nucleic acid samples do contain the one or more mutations indicative of the cancer condition and wherein the training set of sequence reads from the subjects in the training group comprise cfDNA fragments of varying length; applying, using the processor, the established low variability filter to the training set of sequence reads; generating, using the processor and based on the applying, a filtered training set of sequence reads, wherein the generating the filtered training set of sequence reads comprises discarding one or more sequence reads in the training set of sequence reads that are aligned to the high variability region in the reference genome training, using the processor, a machine learning model on the filtered training set of sequence reads; applying, using the processor, a predictive capability of the trained machine learning model to identify differences between the second set of sequence reads of nucleic acid samples from the healthy subjects and the third set of sequence reads of nucleic acid samples from the cancer subjects.
2 . The method of claim 1 , wherein the cancer condition is a cancer type selected from the group consisting of lung cancer, ovarian cancer, kidney cancer, bladder cancer, hepato-biliary cancer, pancreatic cancer, upper gastrointestinal cancer, sarcoma, breast cancer, liver cancer, prostate cancer, brain cancer, and combinations thereof.
3 . The method of claim 1 , further comprising: performing initial data processing of the first set of sequence reads of nucleic acid samples from each healthy subject in the reference group of healthy subjects based on a fourth set of sequence reads of nucleic acid samples from a baseline group of healthy subjects, wherein the reference group and the baseline group do not overlap, and wherein the initial data processing comprises correction of GC biases or normalization of numbers of sequence reads that align to regions of the reference genome.
4 . The method of claim 1 , further comprising: performing initial data processing of the sequence reads of nucleic acid samples from each subject in the training group based on a fourth set of sequence reads of nucleic acid samples from a baseline group of healthy subjects, wherein the baseline group and the training group do not overlap, and wherein the initial data processing comprises correction of GC biases or normalization of numbers of sequence reads aligned to regions of the reference genome.
5 . The method of claim 1 , wherein the quantity data consists of one quantity corresponding to a total number of sequence reads that align to the low variability region.
6 . The method of claim 1 , wherein the quantity data comprises multiple quantities each corresponding to a subset of the sequence reads that align to the low variability region, wherein each sequence read within a same subset corresponds to nucleic acid samples having a same predetermined fragment size or size range, wherein sequence reads in different subsets correspond to nucleic acid samples having a different fragment size or size range.
7 . The method of claim 1 , wherein the one or more parameters are determined by principal component analysis (PCA).
8 . The method of claim 1 , further comprising: refining the one or more parameters in a multi-fold cross-validation process by dividing the second filtered training set of sequence reads into a filtered training subset and a filtered validation subset.
9 . The method of claim 8 , wherein the filtered training and validation subsets in one fold of the multi-fold cross-validation process are different from another filtered training and validation subset in another fold of the multi-fold cross-validation process.
10 . The method of claim 1 , wherein the high variability region in the reference genome corresponds to a plurality of regions in the reference genome that exhibit variability above the predetermined threshold and wherein each of the plurality of regions has the same size.
11 . The method of claim 1 , wherein the high variability region in the reference genome includes a plurality of high variability regions that correspond to a plurality of regions in the reference genome that exhibit variability above the predetermined threshold and wherein each of the plurality of regions do not have the same size.
12 . The method of claim 1 , wherein the one or more parameters are determined based on a subset of the training set of sequence reads.
13 . The method of claim 1 , wherein the nucleic acid samples from the subjects in the training group comprise cfDNA fragments that are longer than the predetermined threshold length, wherein the predetermined threshold length is less than 160 nucleotides.
14 . The method of claim 13 , wherein the predetermined threshold length is 140 nucleotides or less.
15 . The method of claim 13 , wherein the sequence reads in the training set includes sequence reads of cfDNA fragments in the nucleic acid samples from the subjects in the training group having a length falling between a second threshold length and a third threshold length, wherein: the second threshold length is from 240 to 260 nucleotides, and the third threshold length is from 290 nucleotides to 310 nucleotides.
16 . A computer system comprising: one or more processors; and a non-transitory computer-readable medium including one or more sequences of instructions that, when executed by the one or more processors, cause the processors to receive, at a data collection component of the computer system, a first set of sequence reads of nucleic acid samples from each healthy subject in a reference group of healthy subjects, wherein the nucleic acid samples comprise cell-free DNA (cfDNA) fragments and wherein the first set of sequence reads is derived from a DNA sequencing process and wherein the first set of sequence reads are not known to contain one or more mutations indicative of a cancer condition; align, using the one or more processors, each sequence read in the first set of sequence reads to regions in a reference genome; establish, using the one or more processors and subsequent to the aligning, a low variability filter, wherein the instructions to establish comprise instructions to: identify, using the one or more processors, numbers of sequence reads aligned to each of the regions in the reference genome across each healthy subject in the reference group of healthy subjects; derive, using the one or more processors and based on the identifying, quantity data associated with each of the regions based on the numbers of sequence reads; calibrate, using the one or more processors, the quantity data for each of the regions; compute, using the one or more processors, reference quantities for each of the regions based on the calibrated quantity data, wherein the reference quantities comprise at least a first reference quantity and a second reference quantity, wherein the first reference quantity corresponds to an average of the calibrated quantity data and wherein the second reference quantity is a standard deviation of the calibrated quantity data; determine, using the one or more processors, a difference between the first reference quantity and the second reference quantity for each of the regions; classify, using the one or more processors, each of the regions as a high variability region or a low variability region by comparing the difference against a predetermined threshold, wherein the high variability region is defined by the difference being greater than the predetermined threshold and wherein the low variability region is defined by the difference between less than the predetermined threshold; receive, at the data collection component, a training set of sequence reads from subjects in a training group, wherein the training set includes a second set of sequence reads of nucleic acid samples from healthy subjects and a third set of sequence reads of nucleic acid samples from cancer subjects who are known to have a cancer condition, wherein the second set of sequence reads of nucleic acid samples are not known to contain the one or more mutations indicative of the cancer condition and wherein the third set of sequence reads of nucleic acid samples do contain the one or more mutations indicative of the cancer condition and wherein the training set of sequence reads from the subjects in the training group comprise cfDNA fragments of varying length; apply, using the one or more processors, the established low variability filter to the training set of sequence reads; generate, using the one or more processors and based on the applying, a filtered training set of sequence reads, wherein the instructions to generate the filtered training set of sequence reads comprise instructions to discard one or more sequence reads in the filtered training set of sequence reads that are aligned to the high variability region in the reference genome; training, using the processor, a machine learning model on the filtered training set of sequence reads; applying, using the processor, a predictive capability of the trained machine learning model to identify differences between the second set of sequence reads of nucleic acid samples from the healthy subjects and the third set of sequence reads of nucleic acid samples from the cancer subjects.
17 . A non-transitory computer-readable storage medium having stored thereon program code instructions that, when executed by a processor of a computer system, cause the computer system to perform a method comprising: receiving, at a data collection component of a computer system, a first set of sequence reads of nucleic acid samples from each healthy subject in a reference group of healthy subjects, wherein the nucleic acid samples comprise a first set of cell-free DNA (cfDNA) fragments and wherein the first set of sequence reads is derived from a DNA sequencing process and wherein the first set of sequence reads are not known to contain one or more mutations indicative of a cancer condition; aligning, using a processor of the computer system, each sequence read in the first set of sequence reads to regions in a reference genome; establishing, using the processor and subsequent to the aligning, a low variability filter, wherein the establishing comprises: identifying, using the processor, numbers of sequence reads aligned to each of the regions in the reference genome across each healthy subject in the reference group of healthy subjects; deriving, using the processor and based on the identifying, quantity data associated with each of the regions based on the numbers of sequence reads; calibrating, using the processor, the quantity data for each of the regions; computing, using the processor, reference quantities for each of the regions based on the calibrated quantity data, wherein the reference quantities comprise at least a first reference quantity and a second reference quantity, wherein the first reference quantity corresponds to an average of the calibrated quantity data and wherein the second reference quantity is a standard deviation of the calibrated quantity data; determining, using the processor, a difference between the first reference quantity and the second reference quantity for each of the regions; classifying each of the regions as a high variability region or a low variability region by comparing the difference against a predetermined threshold, wherein the high variability region is defined by the difference being greater than the predetermined threshold and wherein the low variability region is defined by the difference between less than the predetermined threshold; receiving, at the data collection component, a training set of sequence reads from subjects in a training group, wherein the training set includes a second set of sequence reads of nucleic acid samples from healthy subjects and a third set of sequence reads of nucleic acid samples from cancer subjects who are known to have a cancer condition, wherein the second set of sequence reads of nucleic acid samples are not known to contain the one or more mutations indicative of the cancer condition and wherein the third set of sequence reads of nucleic acid samples do contain the one or more mutations indicative of the cancer condition and wherein the training set of sequence reads from the subjects in the training group comprise cfDNA fragments of varying length; applying, using the processor, the established low variability filter to the training set of sequence reads; generating, using the processor and based on the applying, a filtered training set of sequence reads, wherein the generating the filtered training set of sequence reads comprises discarding one or more sequence reads in the filtered training set of sequence reads that are aligned to the high variability region in the reference genome; training, using the processor, a machine learning model on the filtered training set of sequence reads; applying, using the processor, a predictive capability of the trained machine learning model to identify differences between the second set of sequence reads of nucleic acid samples from the healthy subjects and the third set of sequence reads of nucleic acid samples from the cancer subjects.

Description

CROSS-REFERENCE TO RELATED APPLICATION The present application claims the benefit of U.S. Provisional Application No. 62/642,461 filed Mar. 13, 2018, which is expressly incorporated herein by reference in its entirety for all purposes. TECHNICAL FIELD Disclosed herein are methods, systems and computing program products for selecting and analyzing biological data of high dimensionality, in particular, nucleic acid sequencing data obtained using next-generation sequencing technologies. BACKGROUND Modern development in biology, especially next-generation sequencing technologies, has generated vast amounts of data. Combing through the data for useful and helpful information, however, remains a big challenge, especially when such useful and helpful information is needed for disease diagnosis and prognosis. For example, the human genome includes over 3 billion base pairs of nucleic acid sequences. Although it is possible to obtain sequence reads of an entire human genome, much of the sequencing data encode information that is irrelevant to disease diagnosis and prognosis. Ways of processing big data are needed in order to efficiently and accurately derive useful and relevant information. SUMMARY In one aspect, disclosed herein is a method of analyzing sequence reads of nucleic acid samples in connection with a disease condition. As disclosed herein, the method can comprise: identifying regions of low variability in a reference genome based on a first set of sequence reads of nucleic acid samples from each healthy subject in a reference group of healthy subjects, wherein each sequence read in the first set of sequence reads of nucleic acid samples from each healthy subject can be aligned to a region in the reference genome; selecting a training set of sequence reads from sequence reads of nucleic acid samples from subjects in a training group, wherein each sequence read in the training set aligns to a region in the regions of low variability in the reference genome, wherein the training set includes sequence reads of nucleic acid samples from healthy subjects and sequence reads of nucleic acid samples from diseased subjects who are known to have the disease condition, and wherein the nucleic acid samples from the training group are of a type that is the same as or similar to that of the nucleic acid samples from the reference group of heathy subjects; determining, using quantities derived from sequence reads of the training set, one or more parameters that reflect differences between sequence reads of nucleic acid samples from the healthy subjects and sequence reads of nucleic acid samples from the diseased subjects within the training group; receiving a test set of sequence reads associated with nucleic acid samples from a test subject whose status with respect to the disease condition is unknown; and predicting a likelihood of the test subject having the disease condition based on the one or more parameters. In some embodiments, the nucleic acid samples comprise cell-free nucleic acid (cfNA) fragments. In some embodiments, the disease condition is cancer. In some embodiments, the disease condition is a cancer type selected from the group consisting of lung cancer, ovarian cancer, kidney cancer, bladder cancer, hepatobiliary caner, pancreatic cancer, upper gastrointestinal cancer, sarcoma, breast cancer, liver cancer, prostate cancer, brain cancer, and combinations thereof. In some embodiments, the method further comprises: performing initial data processing of the first set of sequence reads of nucleic acid samples from each healthy subject in the reference group of healthy subjects based on sequence reads of nucleic acid samples from a baseline group of healthy subjects, wherein the reference group and the baseline group do not overlap, and wherein the initial data processing comprises correction of GC biases or normalization of numbers of sequence reads that align to regions of the reference genome. In some embodiments, the method further comprises: performing initial data processing of the sequence reads of nucleic acid samples from each subject in the training group based on sequence reads of nucleic acid samples from a baseline group of healthy subjects, wherein the baseline group and the training group do not overlap, and wherein the initial data processing comprises correction of GC biases or normalization of numbers of sequence reads aligned to regions of the reference genome. In some embodiments, the identifying regions of low variability in the reference genome further comprises: aligning sequences from the first set of sequence reads of nucleic acid samples from each healthy subject in the reference group of healthy subjects to a plurality of non-overlapping regions of the reference genome, the reference group having a first plurality of healthy subjects; deriving, for each healthy subject in the reference group, a quantity associated with sequence reads that align to a region within the plurality of