Search

EP-4735634-A1 - SYSTEMS AND METHODS OF SEQUENCING POLYNUCLEOTIDES WITH MODIFIED BASES

EP4735634A1EP 4735634 A1EP4735634 A1EP 4735634A1EP-4735634-A1

Abstract

Described are DNA sequencing systems and methods. Systems and methods for determining the nucleotide sequence of a polynucleotide may include determining the sequence of a polynucleotide with more than four types of bases by increasing the number of encoding states in sequencing by synthesis (SBS) systems. The systems and methods may expand the encoding space of current systems by discretizing the amplitude space in a channel into more than two states, and/or by adding additional channels.

Inventors

  • LU, BO
  • WEI, WEI
  • GOH, Hui Kheng Karen
  • VIECELI, JOHN SILVIO
  • TEO, Yin Nah

Assignees

  • Illumina, Inc.

Dates

Publication Date
20260506
Application Date
20240625

Claims (20)

  1. WHAT IS CLAIMED IS: 1. A method of sequencing polynucleotides bound to a flow cell and having one or more modified nucleotides, comprising: detecting fluorescent emissions from a first labeled nucleotide at a first wavelength; detecting fluorescent emissions from a second labeled nucleotide at a second wavelength, wherein the first wavelength is different from the second wavelength; detecting fluorescent emissions from a third labeled nucleotide at the first and second wavelengths; detecting the absence of fluorescent emissions from a fourth labeled nucleotide; detecting fluorescent emissions from a fifth labeled nucleotide at a third wavelength; and determining the sequence of the polynucleotides and the one or more modified nucleotides based on the detected fluorescent emissions, wherein at least one of the labeled nucleotides is a modified nucleotide.
  2. 2. The method of claim 1, wherein one or more modified nucleotides is selected from the group comprising 5-methylcytosine, N6-methyladenine, and inosine.
  3. 3. The method of claim 1, wherein detecting the fluorescent emissions at the third wavelength comprises detecting the fluorescent emissions at a first intensity and at a second intensity, wherein the first intensity corresponds to a first modified nucleotide and wherein the second intensity corresponds to a second modified nucleotide, wherein the third wavelength is different than the first wavelength and the second wavelength.
  4. 4. The method of claim 1, wherein the flow cell comprises wells configured to bind polynucleotides; further wherein the incorporation of one of the at least four labelled nucleotide conjugates into a well is detected from at least one signal state.
  5. 5. The method of claim 4, wherein the presence of an empty well is determined from a dark state.
  6. 6. The method of claims 4 or 5, wherein each of the at least four labelled nucleotides are distributed as a cloud with an intensity level of at least one signal state such that each labelled nucleotide is more similar to nucleotides of the same label than to those from different labelled nucleotides.
  7. 7. The method of claim 6, wherein the at least four labelled nucleotides are distributed as a cloud with an intensity level of at least two signal states.
  8. 8. The method of claim 6, wherein empty wells are distributed as a cloud with an intensity of at least one signal state such that each empty well is more similar to empty wells than to those from the different detectable nucleotide conjugate types, and there are at least five clouds.
  9. 9. The method of claim 1, further comprising determining the presence of an empty well based on the detection pattern of the at least four labelled nucleotides into the polynucleotide thereby determining the sequence of a polynucleotide, wherein determining the presence of an empty well comprises using four clouds, and the empty well is labelled as one of the at least four different types of detectable nucleotide conjugates.
  10. 10. The method of claim 1, further comprising determining the presence of an empty well based on the detection pattern of the at least four labelled nucleotides into the polynucleotide thereby determining the sequence of a polynucleotide, wherein determining the presence of an empty well comprises using five clouds, and the empty well is labelled as a separate cloud.
  11. 11. The method of any of claims 1 through 10, wherein a labelling of empty wells is performed by analyzing if a dark state exists for the first ten cycles of a sequencing run.
  12. 12. The method of claim 11, wherein a labelling of empty wells is performed by analyzing if a series of a single nucleotide exists in the first ten cycles of a sequencing, and relabeling subsequent nucleotides as an empty well.
  13. 13. The method of any of claims 1 through 12, wherein a population of a cloud that comprises a nucleotide and an empty well are segmented into two populations.
  14. 14. The method of any of claims 1 through 13, wherein the population of the cloud is segmented by Otsu’s method.
  15. 15. The method of any of claims 1 through 14, wherein the population of the cloud is segmented by a k-means algorithm.
  16. 16. The method of any of claims 1 through 15, wherein the population of the cloud is segmented by an expectation maximization algorithm.
  17. 17. The method of any of claims 1 through 16, wherein the determining step is performed for at least one of a single cycle, a few early cycles, or every cycle in a sequencing run.
  18. 18. The method of any of claims 1 through 17, wherein the determining step labels a fifth cloud with a placeholder label until the cloud is confirmed to correspond to an empty well after the first 20 cycles in a sequencing run.
  19. 19. The method of any of claims 1 through 18, wherein the determining step is present for every cycle in a sequencing run, and the determining step is further used to detect that an insert was completely sequenced.
  20. 20. A non-transitory computer-readable medium storing a polynucleotide sequencing program including instructions that, when executed by a processor, causes a polynucleotide sequencing apparatus, to: detect fluorescent emissions from a first labeled nucleotide at a first wavelength; detect fluorescent emissions from a second labeled nucleotide at a second wavelength, wherein the first wavelength is different from the second wavelength; detect fluorescent emissions from a third labeled nucleotide at the first and second wavelengths; detect the absence of fluorescent emissions from the fourth labeled nucleotide; detect fluorescent emissions from one or more modified nucleotides at a third wavelength; and determine the sequence of the polynucleotides and the one or more modified nucleotides based on the detected fluorescent emissions.

Description

ILLINC.763WO / IP-2525-PCT PATENT SYSTEMS AND METHODS OF SEQUENCING POLYNUCLEOTIDES WITH MODIFIED BASES INCORPORATION BY REFERENCE TO ANY PRIORITY APPLICATIONS [0001] This application claims priority to U.S. Provisional Application No.63/511,366 filed June 30, 2023, the content of which is incorporated by reference in its entirety. BACKGROUND Field [0002] The present disclosure relates to DNA sequencing systems and methods. In particular, this disclosure relates to improved detection methods for detecting more than four nucleotides by increasing the types of nucleotides detectable in an image, and by increasing the number of images detecting nucleotides. Background [0003] Current sequencing technologies involve determining DNA or RNA sequences by deciphering four natural bases in the genome: A, T (U), G, and C. However, many of these DNA or RNA sequences have modified nucleotide bases. These modified bases play essential roles in biological processes such as epigenetic studies, epi-transcriptomics, human diseases, and cancer. A common form of DNA modification is a methylated C (5-methylcytosine or 5-MeC) base found in CpG dinucleotides. RNA modifications can also arise from noncoding RNA, such as ribosomal RNA and transfer RNA. The current standard for DNA methylation analysis typically uses genome sequencing of bisulfite-converted DNA. Since uracil, read as thymine, will bind to complementary adenosine, 5-MeC can be partially inferred with existing four base detection systems. However, as the number and complexity of chemical base modifications continue to grow, it may be advantageous to be able to detect more than the four unmodified bases during a DNA sequencing process. [0004] Current base calling schemes for four bases generally include either two or four- channel base calling. Some sequencing systems, such as those from Illumina, Inc. (San Diego, CA) use onboard real-time analysis (RTA) to turn raw image data into base calls. This process can be massively parallelized in order to occur in real-time on the instrument. The number of images fed into the RTA base calling software could be either four images (referred to as four-channel base calling) or two images (referred to as two-channel base calling). [0005] For two-channel base calling systems, the four DNA bases are encoded using two bits of information from the two channels (“on and “off” states). After extracting the signal from the images, the intensities are first normalized and scaled through various correction algorithms. A probabilistic model that uses an assumption of equal base diversity is typically used during base-calling to map a probability density distribution of detecting each base based on the observed fluorescence emissions. Once this probabilistic model is determined, the base calls can then be assigned based on the inference outcome of the probabilistic model (i.e., the most likely base). In contrast, a typical four channel base calling system might simply associate each color channel with a specific fluorescently labeled nucleotide base. Using a corrected intensity, the base- calling step assigns a base call label based on the channel with the highest amplitude. Both processes may be repeated for every well on the flow cell surface and every chemistry cycle in the sequencing instrument. [0006] Typical two channel and four channel systems are designed to detect only four distinct fluorescent signals corresponding to the four natural bases, and as configured, cannot determine more than four types of nucleotides. Existing RTA models assume equal base diversity and are trained on the four natural bases and are not currently suitable for detecting more than four bases. The dyes and ratios of dyes used for typical two channel and four channel systems are optimized to resolve four distinct fluorescent signals. Accordingly, typical systems and methods currently do not distinguish between any additional bases and the four natural bases, leading to errors in base identification. SUMMARY [0007] An aspect of the disclosure is directed to a method of sequencing polynucleotides bound to a flow cell and having one or more modified nucleotides, including: detecting fluorescent emissions from a first labeled nucleotide at a first wavelength; detecting fluorescent emissions from a second labeled nucleotide at a second wavelength, wherein the first wavelength is different from the second wavelength; detecting fluorescent emissions from a third labeled nucleotide at the first and second wavelengths; detecting the absence of fluorescent emissions from a fourth labeled nucleotide; detecting fluorescent emissions from a fifth labeled nucleotide at a third wavelength; and determining the sequence of the polynucleotides and the one or more modified nucleotides based on the detected fluorescent emissions wherein at least one of the labeled nucleotides is a modified nucleotide. [0008] In some embodiments, the one or more modified nucleotides can inclu