US-20260125739-A1 - GENERATION OF PHASED READ-SETS FOR GENOME ASSEMBLY AND HAPLOTYPE PHASING

US20260125739A1US 20260125739 A1US20260125739 A1US 20260125739A1US-20260125739-A1

Abstract

Disclosed herein are methods, compositions and systems that facilitate accurate phasing of sequence data such as genomic sequence data through the segmentation and rearrangement of nucleic acid molecules in such a way as to preserve individual molecules' phase or physical linkage information. This is variously accomplished by binding molecules independent of their phosphodiester backbones, cleaving the molecules, ligating, and sequencing the molecules through long-read sequencing technology to recover segment sequence information spanning at least more than one segment.

Inventors

Richard E. Green, JR.
Daniel S. ROKHSAR
Paul Hartley
Marco Blanchette

Assignees

DOVETAIL GENOMICS, LLC

Dates

Publication Date: 20260507
Application Date: 20250610

Claims (20)

1 .- 44 . (canceled)
45 . A rearranged nucleic acid molecule, comprising (a) a first segment; (b) a second segment; and (c) a third segment; (d) said first segment and said second segment being joined at a first junction; and (e) said second segment and said third segment being joined at a second junction, wherein said rearranged nucleic acid molecule has a length of at least 5 kilobases (kb), and wherein said first segment, said second segment and said third segment exist in phase separated by at least 10 kb in an unrearranged nucleic acid molecule, and wherein at least 70% of said rearranged nucleic acid molecule maps to said unrearranged nucleic acid molecule.
46 . The rearranged nucleic acid molecule of claim 45 , wherein said first segment, said second segment and said third segment comprise separate genomic nucleic acid sequence from a common nucleic acid molecule of a genome.
47 . The rearranged nucleic acid molecule of claim 45 , wherein said rearranged nucleic acid molecule is at least 30 kb in length.
48 . The rearranged nucleic acid molecule of claim 45 , wherein said rearranged nucleic acid comprises a hairpin loop at a double-stranded terminal end, and wherein said rearranged nucleic acid molecule comprises a single strand comprising a 30 kb inverted repeat.
49 . The rearranged nucleic acid molecule of claim 45 , wherein at least 80% of said rearranged nucleic acid molecule maps to said common unrearranged nucleic acid molecule.
50 . A nucleic acid sequence library of a nucleic acid sample, said nucleic acid sequence library comprising a population of nucleic acid sequence reads having a mean length of at least 1 kb, said nucleic acid sequences reads independently comprising at least 300 bases of sequence from two separate in-phase regions of said nucleic acid sample, said two separate in-phase regions separated by a distance greater than 10 kb in said nucleic acid sample.
51 . The nucleic acid sequence library of claim 50 , wherein said nucleic acid sequence reads independently comprise at least 500 bases of sequence from two separate in-phase regions of said nucleic acid sample.
52 . The nucleic acid sequence library of claim 50 , wherein said two separate in-phase regions are separated by a distance greater than 20 kb in said nucleic acid sample.
53 . The nucleic acid sequence library of claim 50 , wherein nucleic acid sequence library comprises sequence reads from at least 80% of said nucleic acid sample.
54 . A method of sequencing a nucleic acid molecule, comprising: (a) obtaining a first nucleic acid molecule comprising a first segment, a second segment and a third segment sharing a common phosphodiester backbone, wherein none of said first segment, said second segment and said third segment are adjacent on said first nucleic acid molecule; (b) partitioning said nucleic acid molecule such that said first segment, said second segment and said third segment are associated independent of their common phosphodiester backbone; (c) cleaving said nucleic acid molecule to generate fragments such that there is no continuous phosphodiester backbone linking said first segment, said second segment and said third segment; (d) ligating said fragments such that said first segment, said second segment and said third segment are consecutive on a rearranged nucleic acid molecule sharing a common phosphodiester backbone; and (e) sequencing at least a portion of said rearranged nucleic acid molecule such that at least 5,000 bases of said rearranged nucleic acid molecule are sequenced in a single read.
55 . The method of claim 54 , wherein (b) comprises contacting said nucleic acid molecule to a binding moiety such that said first segment, said second segment and said third segment are bound in a common complex independent of their common phosphodiester backbone.
56 . The method of claim 55 , wherein said binding moiety comprises a population of DNA-binding proteins.
57 . The method of claim 56 , wherein said population of DNA-binding proteins comprises nuclear proteins.
58 . The method of claim 56 , wherein said population of DNA-binding proteins comprises nucleosomes.
59 . The method of claim 56 , wherein said population of DNA-binding proteins comprises histones.
60 . The method of claim 55 , wherein said binding moiety comprises a population of DNA-binding nanoparticles.
61 . The method of claim 54 , wherein (c) comprises contacting said nucleic acid molecule to a restriction endonuclease, a nonspecific endonuclease, a tagmentation enzyme, or a transposase.
62 . The method of claim 54 , wherein (c) comprises shearing said nucleic acid molecule.
63 . The method of claim 54 , wherein (b) comprises separating said nucleic acid molecule from other nucleic acid molecules of a sample, diluting said nucleic acid sample, distributing said nucleic acid molecule into a microdroplet of an emulsion, or a combination thereof.

Description

CROSS-REFERENCE This application is a continuation of U.S. patent application Ser. No. 17/197,551, filed Mar. 10, 2021, which is a continuation of U.S. patent application Ser. No. 16/078,741, filed Aug. 22, 2018, now U.S. Pat. No. 10,975,417, which is national stage entry of International Application No. PCT/US17/19099, filed Feb. 23, 2017, which claims the benefit of U.S. Provisional Application No. 62/298,906, filed Feb. 23, 2016, which is hereby explicitly incorporated by reference in its entirety, and this application also claims the benefit of U.S. Provisional Application No. 62/298,966, filed Feb. 23, 2016, which is hereby explicitly incorporated by reference in its entirety, and this application also claims the benefit of U.S. Provisional Application No. 62/305,957, filed Mar. 9, 2016, which is hereby explicitly incorporated by reference in its entirety. BACKGROUND It remains difficult in theory and in practice to determine haplotype phase information of complex DNA samples, such as those having diploid or polyploid genomes, or those comprising substantial amounts of repetitive or identical sequence. Difficulties arise from loci of interest being separated by highly repetitive regions or by long stretches of identical sequence, such that standard assembly of read information is insufficient to assign phase information to alleles at a locus. SUMMARY Disclosed herein are methods, compositions and systems related to the accurate phasing of nucleic acid sequence data through the generation and sequencing, such as long-read sequencing, of segmentally rearranged nucleic acid molecules such as chromosomes. Disclosed herein are methods of generating long-distance phase information from a first DNA molecule, comprising a) providing a first DNA molecule having a first segment and a second segment, wherein the first segment and the second segment are not adjacent on the first DNA molecule; b) contacting the first DNA molecule to a DNA binding moiety such that the first segment and the second segment are bound to the DNA binding moiety independent of a common phosphodiester backbone of the first DNA molecule; c) cleaving the first DNA molecule such that the first segment and the second segment are not joined by a common phosphodiester backbone; d) attaching the first segment to the second segment via a phosphodiester bond to form a reassembled first DNA molecule; and e) sequencing at least 4 kb of consecutive sequence of the reassembled first DNA molecule comprising a junction between the first segment and the second segment in a single sequencing read, wherein first segment sequence and second segment sequence represent long-distance phase information from a first DNA molecule. In some aspects the DNA binding moiety comprises a plurality of DNA-binding molecules, such as DNA-binding proteins. In some aspects the population of DNA-binding proteins comprises nuclear proteins broadly, nucleosomes, or in some cases, more specifically histones. In some aspects contacting the first DNA molecule to a plurality of DNA-binding moieties comprises contacting to a population of DNA-binding nanoparticles. Often, the first DNA molecule has a third segment not adjacent on the first DNA molecule to the first segment or the second segment, wherein the contacting in (b) is conducted such that the third segment is bound to the DNA binding moiety independent of the common phosphodiester backbone of the first DNA molecule, wherein the cleaving in (c) is conducted such that the third segment is not joined by a common phosphodiester backbone to the first segment and the second segment, wherein the attaching comprises attaching the third segment to the second segment via a phosphodiester bond to form the reassembled first DNA molecule, and wherein the consecutive sequence sequenced in (e) comprises a junction between the second segment and the third segment in a single sequencing read. The method often comprises contacting the first DNA molecule to a cross-linking agent, such as formaldehyde. In some aspects the DNA binding moiety is bound to a surface comprising a plurality of DNA binding moieties. In some aspects the DNA binding moiety is bound to a solid framework comprising a bead. In some aspects cleaving the first DNA molecule comprises contacting to a restriction endonuclease such as a nonspecific endonuclease, a tagmentation enzyme, or a transposase. In some aspects cleaving the first DNA molecule comprises shearing the first molecule. Optionally, the method comprises adding a tag to at least one exposed end. Exemplary tags comprise a labeled base, a methylated base, a biotinylated base, uridine, or any other noncanonical base. In some aspects the tag generates a blunt ended exposed end. In some aspects the method comprises adding at least one base to a recessed strand of a first segment sticky end. In some aspects the method comprises adding a linker oligo comprising an overhang that anneals to the first segment sticky end. I