US-12620454-B2 - BAMBAM: parallel comparative analysis of high-throughput sequencing data

US12620454B2US 12620454 B2US12620454 B2US 12620454B2US-12620454-B2

Abstract

The present invention relates to methods for evaluating and/or predicting the outcome of a clinical condition, such as cancer, metastasis, AIDS, autism, Alzheimer's, and/or Parkinson's disorder. The methods can also be used to monitor and track changes in a patient's DNA and/or RNA during and following a clinical treatment regime. The methods may also be used to evaluate protein and/or metabolite levels that correlate with such clinical conditions. The methods are also of use to ascertain the probability outcome for a patient's particular prognosis.

Inventors

John Zachary Sanborn
David Haussler

Assignees

THE REGENTS OF THE UNIVERSITY OF CALIFORNIA

Dates

Publication Date: 20260505
Application Date: 20210903

Claims (20)

1 . A computer-based method of variant calling from at least one tissue sample, the method comprising: (a) retrieving, at a same time from each of at least two BAM/SAM files stored in a first storage device while keeping each file in synchrony, only a set of aligned short reads of genomic sequence data associated with a respective pileup, each set of aligned short reads being retrieved by a computer processor coupled with a computer readable memory, wherein each set of aligned short reads overlap a given common genomic position of a reference sequence; (b) storing, in the computer readable memory, each retrieved set of aligned short reads of genomic sequence data, all the stored aligned short reads overlapping the given common genomic position; (c) calculating, via the computer processor coupled with the computer readable memory, a probability of an allele at the given common genomic position as a function of a number of observed alleles in the aligned short reads of the respective pileup while incorporating a base error rate of a sequencer; (d) repeating (a)-(c) for each of additional given common genomic positions using additional sets of aligned short reads, wherein the sets of aligned short reads represent at least 10% of a genome, transcriptome, or proteome of the at least one tissue sample; and (e) storing in a second storage device a variant supported by the allele at at least one of the given common genomic positions.
2 . The method of claim 1 , wherein the allele represents at least one of a germline allele and a tumor allele.
3 . The method of claim 2 , further comprising calculating the probability as a function of a base probability from at least one of two parental alleles and the number of observed alleles in the aligned short reads of the respective pileup.
4 . The method of claim 1 , wherein the allele comprises an allele-specific copy number.
5 . The method of claim 4 , further comprising expanding or contracting an analysis window around the given common genomic position.
6 . The method of claim 4 , further comprising calculating the allele-specific copy number as a function of a number of supporting aligned short reads.
7 . The method of claim 4 , further comprising calculating the allele-specific copy number at the given common genomic position, wherein the given common genomic position has at least two different alleles.
8 . The method of claim 7 , wherein the allele-specific copy number comprises a majority allele count.
9 . The method of claim 7 , wherein the allele-specific copy number comprises a minority allele count.
10 . The method of claim 4 , further comprising identifying a loss of heterozygosity at least at the given common genomic position as a function of the allele-specific copy number.
11 . The method of claim 10 , wherein the loss of heterozygosity is associated with at least a region of the genome.
12 . The method of claim 1 , wherein the allele is selected as supporting the variant based on an allelic imbalance at the given common genomic position.
13 . The method of claim 1 , wherein the allele is selected as supporting the variant regardless of an allelic proportion.
14 . The method of claim 1 , wherein the allele comprises an allelic state.
15 . The method of claim 14 , wherein the allelic state represents homozygosity.
16 . The method of claim 1 , further comprising estimating an amount of a contaminant in the at least one tissue sample as a function of a hemizygous loss.
17 . The method of claim 16 , wherein the hemizygous loss is based on an allele-specific copy number.
18 . The method of claim 17 , wherein the contaminant comprises a normal contaminant in the at least one tissue sample.
19 . The method of claim 17 , wherein the contaminant comprises a normal contaminant in a tumor tissue sample.
20 . The method of claim 1 , wherein the at least one tissue sample comprises a normal tissue sample.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This application is a continuation of U.S. application Ser. No. 15/711,487, filed Sep. 21, 2017, which is a continuation of U.S. application Ser. No. 15/167,507, filed May 27, 2016, which is a continuation of U.S. application Ser. No. 13/134,047, filed May 25, 2011, which claims the benefit of U.S. Provisional Application No. 61/396,356, filed May 25, 2010, the contents of which are hereby incorporated by reference in their entirety. SEQUENCE LISTING The instant application contains a Sequence Listing which has been submitted electronically in ASCII format and is hereby incorporated by reference in its entirety. Said ASCII copy, created on Jul. 29, 2021, is named 102913-001831US-1260936 Sequence Listing.txt and is 700 bytes in size. BACKGROUND A central premise in modern cancer treatment is that patient diagnosis, prognosis, risk assessment, and treatment response prediction can be improved by stratification of cancers based on genomic, transcriptional and epigenomic characteristics of the tumor alongside relevant clinical information gathered at the time of diagnosis (for example, patient history, tumor histology and stage) as well as subsequent clinical follow-up data (for example, treatment regimens and disease recurrence events). With the release of multiple tumor and matched normal whole genome sequences from projects like The Cancer Genome Atlas (TCGA), there is great need for computationally efficient tools that can extract as much genomic information as possible from these enormous datasets (TCGA, 2008). Considering that a single patient's whole genome sequence at high coverage (>30×) can be hundreds of gigabytes in compressed form, an analysis comparing a pair of these large datasets is slow and difficult to manage, but absolutely necessary in order to discover the many genomic changes that occurred in each patient's tumor. Breast cancer is clinically and genomically heterogeneous and is composed of several pathologically and molecularly distinct subtypes. Patient responses to conventional and targeted therapeutics differ among subtypes motivating the development of marker guided therapeutic strategies. Collections of breast cancer cell lines mirror many of the molecular subtypes and pathways found in tumors, suggesting that treatment of cell lines with candidate therapeutic compounds can guide identification of associations between molecular subtypes, pathways and drug response. In a test of 77 therapeutic compounds, nearly all drugs show differential responses across these cell lines and approximately half show subtype-, pathway and/or genomic aberration-specific responses. These observations suggest mechanisms of response and resistance that may inform clinical drug deployment as well as efforts to combine drugs effectively. There is currently a need to provide methods that can be used in characterization, diagnosis, treatment, and determining outcome of diseases and disorders. BRIEF DESCRIPTION OF THE INVENTION The invention provides methods for generating databases that may be used to determine an individual's risk, in particular, for example, but not limited to, risk of the individual's predisposition to a disease, disorder, or condition; risk at the individual's place of work, abode, at school, or the like; risk of an individual's exposure to toxins, carcinogens, mutagens, and the like, and risk of an individual's dietary habits. In addition, the invention provides methods that may be used for identifying a particular individual, animal, plant, or microorganism. In one embodiment, the invention provides a method of deriving a differential genetic sequence object, the method comprising: providing access to a genetic database storing (a) a first genetic sequence string representing a first tissue and (b) a second genetic sequence string representing a second tissue, wherein the first and second sequence strings have a plurality of corresponding sub-strings; providing access to a sequence analysis engine coupled with the genetic database; producing, using the sequence analysis engine, a local alignment by incrementally synchronizing the first and second sequence strings using a known position of at least one of plurality of corresponding sub-strings; using, by the sequence analysis engine, the local alignment to generate a local differential string between the first and second sequence strings within the local alignment; and using, by the sequence analysis engine, the local differential string to update a differential genetic sequence object in a differential sequence database. In a preferred embodiment, the first and second genetic sequence strings represent at least 10% of a genome, transcriptome, or proteome of the first and second tissues, respectively. In an alternative preferred embodiment, the first and second genetic sequence strings represent at least 50% of a genome, transcriptome, or proteome of the first and second tissues, respectively. In an