CN-114945987-B - Estimating tumor purity from a single sample
Abstract
The present disclosure provides methods of predicting tumor purity from tumor samples without the use of matched normal controls. A set of genomic regions is identified based on the nucleic acid sequence data aligned with the reference genome. Each genomic region of the set of genomic regions comprises one or more nucleotide sequence variants of the corresponding genomic region relative to a reference genome. A B allele frequency distribution of the biological sample is determined based on the determined B allele frequency of each genomic region of the set of genomic regions. The B allele frequency distribution is processed using a trained machine learning model to predictively identify a measure of tumor purity in the biological sample.
Inventors
- NICHOLAS PHILLIPS
- JASON HARRIS
Assignees
- 佩索纳里斯公司
- 佩索纳里斯公司
Dates
- Publication Date
- 20260421
- Application Date
- 20201104
- Priority Date
- 20191105
Claims (20)
- 1. A system for determining tumor purity comprising one or more data processors and a non-transitory computer-readable storage medium containing instructions that, when executed on the one or more data processors, cause the one or more data processors to perform the steps of: obtaining nucleic acid sequence data for a plurality of nucleic acid molecules representative of a biological sample of a subject; aligning the nucleic acid sequence data with a reference genome; Identifying a set of genomic regions based on the aligned nucleic acid sequence data, wherein each genomic region in the set of genomic regions comprises one or more nucleotide sequence variations relative to a corresponding genomic region of the reference genome; determining a B allele frequency for each genomic region in the set of genomic regions; determining a B allele frequency distribution of the biological sample based on the B allele frequencies of the set of genomic regions; processing the B allele frequency distribution using a trained machine learning model to predict a metric identifying tumor purity in the biological sample, wherein the trained machine learning model comprises a convolutional neural network trained on a training dataset generated from nucleic acid sequence data derived from one or more tumor cells diluted to normal cells, and Outputting the metric.
- 2. The system of claim 1, wherein the nucleic acid sequence data is whole-exome sequencing data.
- 3. The system of claim 1, wherein the nucleic acid sequence data is whole genome sequencing data.
- 4. The system of claim 1, further comprising: Obtaining the biological sample from the subject, and Sequencing a plurality of nucleic acid molecules of the biological sample to generate the nucleic acid sequence data.
- 5. The system of claim 4, further comprising isolating the plurality of nucleic acid molecules prior to sequencing.
- 6. The system of claim 1, wherein identifying the set of gene regions further comprises: Identifying one or more candidate nucleotide sequence variations in the nucleic acid sequence data, and The reference and substitution read depths for each of the one or more candidate nucleotide sequence variations are calculated.
- 7. The system of claim 1, wherein the B allele frequency distribution is normalized.
- 8. The system of claim 1, wherein the trained machine learning model has an average absolute error of less than about 0.2.
- 9. The system of claim 1, further comprising outputting a report comprising information identifying the B allele frequency distribution.
- 10. The system of claim 1, further comprising outputting a report comprising a metric identifying an estimate of the tumor purity.
- 11. The system of claim 10, wherein the report further comprises information identifying at least one biomarker.
- 12. The system of claim 10, wherein the report further comprises information identifying at least one prognostic marker.
- 13. The system of claim 10, wherein the report includes information identifying predicted somatic variations.
- 14. The system of claim 1, wherein the biological sample is from a human subject.
- 15. A computer program product comprising instructions configured to cause one or more data processors to: obtaining nucleic acid sequence data for a plurality of nucleic acid molecules representative of a biological sample of a subject; aligning the nucleic acid sequence data with a reference genome; Identifying a set of genomic regions based on the aligned nucleic acid sequence data, wherein each genomic region in the set of genomic regions comprises one or more nucleotide sequence variations relative to a corresponding genomic region of the reference genome; determining a B allele frequency for each genomic region in the set of genomic regions; determining a B allele frequency distribution of the biological sample based on the B allele frequencies of the set of genomic regions; processing the B allele frequency distribution using a trained machine learning model to predict a metric identifying tumor purity in the biological sample, wherein the trained machine learning model comprises a convolutional neural network trained on a training dataset generated from nucleic acid sequence data derived from one or more tumor cells diluted to normal cells, and Outputting the metric.
- 16. The computer program product of claim 15, wherein the nucleic acid sequence data is whole-exome sequencing data.
- 17. The computer program product of claim 15, wherein the nucleic acid sequence data is whole genome sequencing data.
- 18. The computer program product of claim 15, wherein identifying the set of gene regions further comprises: Identifying one or more candidate nucleotide sequence variations in the nucleic acid sequence data, and The reference and substitution read depths for each of the one or more candidate nucleotide sequence variations are calculated.
- 19. The computer program product of claim 15, wherein the B allele frequency distribution is normalized.
- 20. The computer program product of claim 15, wherein the trained machine learning model has an average absolute error of less than about 0.2.
Description
Estimating tumor purity from a single sample Cross Reference to Related Applications The present application claims priority from U.S. provisional patent application No. 62/931,096, filed on 5, 11, 2019, which is hereby incorporated by reference in its entirety for all purposes. Technical Field The present disclosure relates generally to systems and methods for predicting tumor purity from a single sample. More particularly, but not by way of limitation, the present disclosure relates to predicting tumor purity of a biological sample by processing B allele frequency distribution using a trained machine learning model. Background Tumor cellularity, also known as "tumor purity", can identify the proportion of cancer cells in a sample. Accurate estimation of tumor purity in biological samples may help to improve the accuracy of detecting somatic cell mutations and/or copy number changes. This is because tumor purity indicates the allele frequency of somatic mutations present in a biological sample. Detection of somatic mutations and copy number variations can in turn be used to determine the stage of cancer in a subject or to assess whether a particular cancer treatment is effective. Thus, tumor purity can inform determination of cancer stage and/or assessment of treatment effect. While tumor purity may be an effective indicator, it may also be an confounding variable in several bioinformatic analyses. For example, conventional techniques for predicting tumor purity may require a pathologist to perform a histopathological assessment by manually examining the sample images to predict tumor purity. However, histopathological assessment, including manual examination of sample images, can be subjective and inaccurate. Other conventional techniques for predicting tumor purity require comparing the value of nucleic acid sequence data (e.g., putative somatic variations) derived from a given tumor sample with other values of nucleic acid sequencing data derived from a matched normal control sample. However, such normal control samples may not be available. For example, conventional techniques estimate tumor purity of a sample as a function of the allele fraction of somatic mutations characteristic of an individual tumor. In the absence of matched normal samples, the identification of these somatic mutations is less accurate and the accuracy of purity prediction is greatly reduced. In some cases, if the sample provider does not collect or sequence a normal control (for example), then the matched normal control is not available. Thus, there is a need to accurately predict tumor purity in samples to facilitate detection, without relying on subjective analysis (e.g., histopathological evaluation) or the presence of normal control samples. Summary of The Invention In some embodiments, methods of predicting tumor purity are provided. The method may include obtaining nucleic acid sequence data for a plurality of nucleic acid molecules representative of a tumor sample of the subject. The method may further comprise aligning the nucleic acid sequence data with a reference genome. The method may further comprise identifying a set of genomic regions based on the aligned nucleic acid sequence data. In some cases, each genomic region of the set of genomic regions comprises one or more nucleotide sequence variants relative to a corresponding genomic region of a reference genome. The method may further comprise determining a B allele frequency for each genomic region of the set of genomic regions. The method may further comprise determining a B allele frequency distribution of the biological sample based on the B allele frequencies of the set of genomic regions. The method may further comprise processing the B allele frequency distribution using a trained machine learning model to estimate a metric identifying tumor purity of the biological sample. The method may further comprise outputting the metric. In some embodiments, a system is provided that includes one or more data processors and a non-transitory computer-readable storage medium containing instructions that, when executed on the one or more data processors, cause the one or more data processors to perform a portion or all of one or more methods disclosed herein. In some embodiments, a computer program product is provided that is tangibly embodied in a non-transitory machine-readable storage medium and includes instructions configured to cause one or more data processors to perform a portion or all of one or more methods disclosed herein. Some embodiments of the present disclosure include a system comprising one or more data processors. In some embodiments, a system includes a non-transitory computer-readable storage medium containing instructions that, when executed on one or more data processors, cause the one or more data processors to perform a portion or all of one or more methods disclosed herein, and/or a portion or all of one or more processes disclosed herein. Some