US-20260125750-A1 - COMPOSITIONS AND METHODS FOR IDENTIFYING NUCLEIC ACID MOLECULES

US20260125750A1US 20260125750 A1US20260125750 A1US 20260125750A1US-20260125750-A1

Abstract

The present disclosure provides methods and compositions for sequencing nucleic acid molecules and identifying individual sample nucleic acid molecules using Molecular Index Tags (MITs). Furthermore, reaction mixtures, kits, and adapter libraries are provided.

Inventors

Bernhard Zimmermann
Ryan SWENERTON
Matthew Rabinowitz
Styrmir Sigurjonsson
George Gemelos
Apratim GANGULY
Himanshu SETHI

Assignees

NATERA, INC.

Dates

Publication Date: 20260507
Application Date: 20251107

Claims (20)

1 . A method comprising: (a) extracting cell-free DNA (cfDNA) from a biological sample of a subject having cancer or suspected of having cancer; (b) producing a non-naturally occurring composition of enriched DNA, wherein the producing comprises: (i) tagging at least one adaptor to at least one end of the extracted cfDNA or DNA fragment derived therefrom to produce adapted DNA, wherein the at least one adaptor comprises a molecular index tag (MIT), (ii) performing universal amplification on the adapted DNA to produce amplified adapted DNA, and (iii) selectively enriching for at least a portion of the amplified adapted DNA to produce enriched DNA, wherein the selectively enriching comprises performing one-sided PCR using a universal primer and a plurality of target-specific primers to amplify at least a portion of the amplified adapted DNA or using a plurality of hybrid capture probes to capture at least a portion of the amplified adapted DNA; and (c) analyzing the enriched DNA, wherein the analyzing comprises: (i) performing massively parallel sequencing on the enriched DNA to obtain sequence reads, wherein the sequence reads comprise sequencing the MIT and a sequence of the extracted cfDNA or DNA fragment derived therefrom, (ii) grouping the sequence reads based on a set of sequence features, and (iii) using the grouped sequence reads to identify one or more cancer mutations in the biological sample of the subject.
2 . The method of claim 1 , wherein the set of sequence features for grouping the sequence reads comprise the same target locus, the same MIT in the same relative position to the target locus, and optionally the same fragment end position of the extracted cfDNA or DNA fragment derived therefrom.
3 . The method of claim 1 , wherein the biological sample is a blood, plasma, serum, or urine sample.
4 . The method of claim 1 , wherein the biological sample is a plasma sample.
5 . The method of claim 1 , wherein between 50 and 1,000 different MITs are tagged to the extracted cfDNA or DNA fragment derived therefrom, wherein each MIT comprises 3 to 8 nucleotides in length, wherein the sequences of different MITs differ from each other by at least 2 nucleotides.
6 . The method of claim 1 , wherein between 100 and 500 different MITs are tagged to the extracted cfDNA or DNA fragment derived therefrom, wherein each MIT comprises 4 to 8 nucleotides in length, wherein the sequences of different MITs differ from each other by at least 2 nucleotides.
7 . The method of claim 1 , wherein the adaptor further comprises a universal priming sequence, wherein the MIT is positioned internal to the universal priming sequence in the adapted DNA.
8 . The method of claim 1 , wherein the adaptor further comprises a universal priming sequence, wherein step (b) (ii) comprises performing universal amplification using the universal priming sequence to produce amplified adapted DNA.
9 . The method of claim 2 , wherein the grouping comprises grouping sequence reads having the same target locus, the same pair of MITs in the same relative positions to the target locus, and the same start and end genomic coordinates of the extracted cfDNA or DNA fragment derived therefrom when mapped to a reference genome.
10 . The method of claim 9 , wherein the grouping further comprises pairing a first family of grouped sequence reads having a positive genomic orientation and a second family of grouped complementary sequence reads having a negative genomic orientation, wherein the first family and second family comprise complementary MITs in the same relative position.
11 . The method of claim 2 , wherein the grouping of the sequence reads using the MIT and the fragment end position reduces error rate for identifying the cancer mutation, wherein the error rate is calculated by counting all base calls across the target locus that do not correspond to a reference genome and dividing them by the total base calls across the target locus.
12 . The method of claim 9 , wherein the grouping of the sequence reads using the pair of MITs and start and end genomic coordinates of the extracted cfDNA or DNA fragment derived therefrom when mapped to a reference genome reduces error rate for identifying the cancer mutation, wherein the error rate is calculated by counting all base calls across the target locus that do not correspond to a reference genome and dividing them by the total base calls across the target locus.
13 . The method of claim 10 , wherein the grouping of the sequence reads using the pair of MITs and start and end genomic coordinates of the extracted cfDNA or DNA fragment derived therefrom when mapped to a reference genome reduces error rate for identifying the cancer mutation, wherein the error rate is calculated by counting all base calls within all paired families across the target locus that do not correspond to a reference genome and dividing them by the total base calls within all paired families across the target locus.
14 . The method of claim 1 , wherein step (b) (iii) comprises selectively enriching 50-5,000 target loci.
15 . The method of claim 1 , wherein step (b) (iii) comprises selectively enriching 100-2,500 target loci.
16 . The method of claim 14 , wherein step (b) (iii) comprises performing one-sided PCR using a universal primer and a plurality of target-specific primers to amplify at least a portion of the amplified adapted DNA.
17 . The method of claim 14 , wherein step (b) (iii) comprises using a plurality of hybrid capture probe to capture at least a portion of the amplified adapted DNA.
18 . The method of claim 1 , wherein the cancer mutation comprises a single nucleotide variant, an insertion, a deletion, or a copy number variation.
19 . The method of claim 1 , wherein the enriched DNA is further amplified to introduce a sample-specific barcode, and wherein the enriched DNA from multiple samples are pooled and sequenced together.
20 . The method of claim 1 , wherein the method further comprises estimating the fraction of cancer DNA in the extracted cfDNA based on the sequence reads.

Description

RELATED APPLICATIONS This application is a continuation of U.S. Utility application Ser. No. 18/075,180, filed Dec. 5, 2022. U.S. Utility application Ser. No. 18/075,180 is a continuation of U.S. Utility application Ser. No. 17/494,726, filed Oct. 5, 2021 (now U.S. Pat. No. 11,519,028). U.S. Utility application Ser. No. 17/494,726 is a continuation of U.S. Utility application Ser. No. 16/418,104, filed May 21, 2019 (now U.S. Pat. No. 11,530,442). U.S. Utility application Ser. No. 16/418,104 is a continuation of U.S. Utility application Ser. No. 15/716,331, filed Sep. 26, 2017 (now U.S. Pat. No. 10,577,650). U.S. Utility application Ser. No. 15/716,331 (now U.S. Pat. No. 10,577,650) is a divisional of U.S. Utility application Ser. No. 15/372,279, filed Dec. 7, 2016 (now U.S. Pat. No. 10,011,870). The entirety of this application is hereby incorporated herein by reference for the teachings therein. SEQUENCE LISTING The instant application contains a Sequence Listing which has been submitted electronically in XML format and is hereby incorporated by reference in its entirety. Said XML copy, created on Feb. 23, 2023, is named N_018_US_07_SL.xml and is 26,652 bytes in size. FIELD OF THE INVENTION The disclosed present disclosures relate generally to methods for analyzing nucleic acids. BACKGROUND OF THE INVENTION Next-generation sequencing has greatly increased the throughput of sequencing methods and resulted in new applications for sequencing with important real-world implications, such as improvements in cancer diagnostics and non-invasive prenatal testing for disorders such as Down's Syndrome. There are various technologies for performing next-generation sequencing, each of which is associated with specific types of errors. In addition, these methods share general sources for errors, such as errors that occur during sample preparation. Sample preparation for next-generation sequencing typically involves numerous amplification steps, each of which generates errors. Amplification reactions, such as PCR, used in sample preparation for high-throughput sequencing can include amplifying the initial nucleic acid in the sample to generate the library to be sequenced, clonally amplifying the library, typically onto a solid support, and additional amplification reactions to add additional information or functionality such as sample identifying barcodes. Errors can be introduced during any of the amplification reactions, for example through the misincorporation of bases by a polymerase used for the amplification. It can be difficult to distinguish these errors introduced during sample prep and errors that occur during a sequencing reaction, from real and informative SNPs, or mutations present in the initial sample, especially when the SNPs or mutations are present at a low frequency. In addition, calling the base at each nucleotide can introduce errors as well, usually caused by a low signal intensity and/or the surrounding nucleic acid sequence. There are several known methods to identify errors caused by sample preparation. One method is to have greater sequencing depth such that the sample nucleic acid segment is read multiple times from the same molecule, or from different copies of the same nucleic acid molecule. These multiple reads can be aligned and a consensus sequence can be generated. However, SNPs or mutations with low frequency in the population of nucleic acid molecules will appear similar to errors introduced during amplification or base calling. Another method to identify these errors involves tagging nucleic acid molecules such that each nucleic acid molecule incorporates a unique identifier before being sequenced. The sequencing results from identically tagged nucleic acid molecules are pooled and the consensus sequence from these pooled results is more likely to be the true sequence of the nucleic acid from the sample. Amplification errors can be identified if some of the identically tagged nucleic acid molecules have a different sequence. Despite these prior methods, there is a need to discover advantageous combinations of parameters for methods of tagging nucleic acid molecules that are highly effective and readily manufacturable, especially for analyzing complex samples, including mammalian cDNA or genomic samples such as, for example, circulating DNA samples. Many prior art methods require the generation of large numbers of unique identifiers and may also result in the need for longer unique identifiers. The reaction mixtures in such methods are designed so there is a large excess of unique identifiers relative to sample nucleic acid molecules. In addition to the high cost of making such libraries of unique identifiers, increasing the lengths of the unique identifiers reduces the amount of sample nucleic acid sequence that can be read in the already limited read lengths of most next-generation sequencers. In other prior art disclosures, which sometimes are only prophetic, detailed combinations of