EP-4735637-A2 - MODIFYING SEQUENCING CYCLES OR IMAGING DURING A SEQUENCING RUN TO MEET CUSTOMIZED COVERAGE ESTIMATION
Abstract
This disclosure describes methods, non-transitory-computer readable media, and systems that can modify sequencing runs to ensure all genomic samples meet a target read-coverage level. The disclosed system can estimate read coverage for each genomic sample in a genomic pool based on (i) clusters belonging to each sample derived from indexing sequences and/or (ii) filter metrics corresponding to each sample within a flow-cell pool. The disclosed systems can modify a sequencing run based on the estimated read coverage and a target read coverage. For example, the disclosed systems can adjust a number of sequencing cycles within a sequencing run to ensure that all genomic samples meet the target read coverage. Additionally, or alternatively, the disclosed systems can determine a set of flow cell tiles to be imaged to ensure that all genomic samples meet the target read coverage.
Inventors
- FUHRMANN, Alexander
- SANGIORGIO, PAUL
- COREY, VICTORIA
Assignees
- Illumina, Inc.
Dates
- Publication Date
- 20260506
- Application Date
- 20240626
Claims (20)
- 1. A system comprising: an imaging system; a fluidic system; and a computing engine comprising: at least one processor; and a non-transitory computer readable medium comprising instructions that, when executed by the at least one processor, cause the system to: determine, from a subset of sequencing cycles of a sequencing run for genomic samples, base calls for indexing sequences within clusters of oligonucleotides; determine, based on the indexing sequences, respective numbers of clusters of oligonucleotides belonging to respective genomic samples of the genomic samples; estimate read-coverage levels for the genomic samples based on the respective numbers of clusters of oligonucleotides belonging to respective genomic samples and a currently selected number of sequencing cycles for the sequencing run; generate, for the sequencing run and based on the estimated read-coverage levels, a customized number of sequencing cycles sufficient to generate nucleotide reads satisfying a target read-coverage level for each genomic sample of the genomic samples; and execute the sequencing run until finishing the customized number of sequencing cycles.
- 2. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to estimate the read-coverage levels by: determining filter metrics indicating subsets of clusters of oligonucleotides satisfying a filtering threshold for signals of the clusters of oligonucleotides; and estimating the read-coverage levels for the genomic samples based on the filter metrics and the respective numbers of clusters of oligonucleotides belonging to respective genomic samples.
- 3. The system of claim 2, further comprising instructions that, when executed by the at least one processor, cause the system to determine the filter metrics by determining, in a pass filter map, a percentage of clusters belonging to each genomic sample that satisfy a chastity filter for signals emitted from the clusters of oligonucleotides.
- 4. The system of claim 2, further comprising instructions that, when executed by the at least one processor, cause the system to estimate the read-coverage levels for the genomic samples by: determining, based on the filter metrics and the respective numbers of clusters of oligonucleotides belonging to respective genomic samples, a number of filter-passing clusters of oligonucleotides for each genomic sample of the genomic samples that satisfy the filtering threshold; and estimating a minimum number of nucleotide reads covering genomic regions of each genomic sample based on the number of filter-passing clusters of oligonucleotides.
- 5. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: determine, based on the estimated read-coverage levels, a customized set of flow cell regions to be imaged from a flow cell sufficient to generate the nucleotide reads satisfying the target read-coverage level for each genomic sample of the genomic samples; and execute the sequencing run by capturing images of the customized set of flow cell regions for the customized number of sequencing cycles using the imaging system.
- 6. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to perform the subset of sequencing cycles according to an order of indexing cycles before genomic sequencing cycles by: determining base calls for a first indexing sequence appended to a sample genomic sequence of a genomic sample; determining base calls for a second indexing sequence appended to the sample genomic sequence of the genomic sample; and after determining the base calls for the first indexing sequence and the second indexing sequence, determining base calls for a first nucleotide read corresponding to a first portion of the sample genomic sequence and determining base calls for a second nucleotide read corresponding to a second portion of the sample genomic sequence.
- 7. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to determine the respective numbers of clusters of oligonucleotides belonging to the respective genomic samples by: identifying, from among the indexing sequences, assigned indexing sequences matching indexing sequences registered for the sequencing run and unassigned indexing sequences that do not match the indexing sequences registered for the sequencing run; removing, from data for the sequencing run, a subset of clusters of oligonucleotides corresponding to the unassigned indexing sequences; determining respective subsets of assigned indexing sequences that correspond to the respective genomic samples; and determining, from among the respective subsets of assigned indexing sequences, a number of clusters of oligonucleotides belonging to each genomic sample.
- 8. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to generate the customized number of sequencing cycles for the sequencing run by increasing or decreasing a preset number of sequencing cycles for the sequencing run.
- 9. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to generate the customized number of sequencing cycles for the sequencing run by: identifying a minimum number of sequencing cycles and a maximum number of sequencing cycles for the sequencing run; and increasing or decreasing a preset number of sequencing cycles for the sequencing run to the customized number of sequencing cycles within the minimum number of sequencing cycles and the maximum number of sequencing cycles.
- 10. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to estimate the read-coverage levels by: determining, from the sequencing run, a number of unique nucleotide reads aligned with a reference genome; determining, from the sequencing run, a number of filter-passing nucleotide reads from filter-passing cluster of oligonucleotides with signals that satisfy a filtering threshold; determining a bioinformatics efficiency metric by dividing the number of unique nucleotide reads by the number of filter-passing nucleotide reads; and estimating the read-coverage levels for the genomic samples based on the bioinformatics efficiency metric and the respective numbers of clusters of oligonucleotides belonging to respective genomic samples.
- 11. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to detect a reagent volume of a reagent cartridge in fluid communication with the fluidic system and operate the fluidic system to perform one or more additional sequencing cycles relative to the currently selected number of sequencing cycles until finishing the customized number of sequencing cycles by aspirating one or more reagents from the reagent cartridge.
- 12. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to terminate operation of the fluidic system from performing one or more sequencing cycles of the currently selected number of sequencing cycles to finish the sequencing run after performing the customized number of sequencing cycles.
- 13. A system comprising: an imaging system; a fluidic system and a computing engine comprising: at least one processor; and a non-transitory computer readable medium comprising instructions that, when executed by the at least one processor, cause the system to: determine, from a subset of sequencing cycles of a sequencing run for genomic samples, base calls for indexing sequences within clusters of oligonucleotides; determine, based on the indexing sequences, respective numbers of clusters of oligonucleotides belonging to respective genomic samples of the genomic samples; estimate read-coverage levels for the genomic samples based on the respective numbers of clusters of oligonucleotides belonging to respective genomic samples; determine, from a flow cell and based on the estimated read-coverage level, a customized set of flow cell regions to be imaged sufficient to generate nucleotide reads satisfying a target read-coverage level for each genomic sample of the genomic samples; and execute the sequencing run by capturing images of the customized set of flow cell regions during sequencing cycles of the sequencing run.
- 14. The system of claim 13, further comprising instructions that, when executed by the at least one processor, cause the system to determine the customized set of flow cell regions by determining a customized number of flow cell regions to be imaged sufficient to generate the nucleotide reads satisfying the target read-coverage level for each genomic sample.
- 15. The system of claim 13, further comprising instructions that, when executed by the at least one processor, cause the system to determine the customized set of flow cell regions by determining, from a flow cell, a set of tiles to be imaged sufficient to generate the nucleotide reads satisfying the target read-coverage level for each genomic sample.
- 16. The system of claim 13, further comprising instructions that, when executed by the at least one processor, cause the system to capture the images of the customized set of flow cell regions without adjusting a currently selected number of sequencing cycles for the sequencing run.
- 17. The system of claim 13, further comprising instructions that, when executed by the at least one processor, cause the system to determine the customized set of flow cell regions by increasing or decreasing a number of flow cell regions from an initial set of flow cell regions selected for the sequencing run.
- 18. The system of claim 13, further comprising instructions that, when executed by the at least one processor, cause the system to estimate the read-coverage levels by: determining filter metrics indicating subsets of clusters of oligonucleotides satisfying a filtering threshold for signals of the clusters of oligonucleotides; and estimating the read-coverage levels for the genomic samples based on the filter metrics and the respective numbers of clusters of oligonucleotides belonging to respective genomic samples.
- 19. The system of claim 18, further comprising instructions that, when executed by the at least one processor, cause the system to determine the filter metrics by determining, in a pass filter map, a percentage of clusters belonging to each genomic sample that satisfy a chastity filter for signals emitted from the clusters of oligonucleotides.
- 20. The system of claim 18, further comprising instructions that, when executed by the at least one processor, cause the system to estimate the read-coverage levels for the genomic samples by: determining, based on the filter metrics and the respective numbers of clusters of oligonucleotides belonging to respective genomic samples, a number of filter-passing clusters of oligonucleotides for each genomic sample of the genomic samples that satisfy the filtering threshold; and estimating a minimum number of nucleotide reads covering genomic regions of each genomic sample based on the number of filter-passing clusters of oligonucleotides.
Description
MODIFYING SEQUENCING CYCLES OR IMAGING DURING A SEQUENCING RUN TO MEET CUSTOMIZED COVERAGE ESTIMATION CROSS-REFERENCE TO RELATED APPLICATIONS [0001] This application claims priority to and the benefit of U.S. Provisional Patent Application No. 63/511,564, entitled “MODIFYING SEQUENCING CYCLES OR IMAGING DURING A SEQUENCING RUN TO MEET CUSTOMIZED COVERAGE ESTIMATION,” filed on June 30, 2023, and U.S. Provisional Patent Application No. 63/517,160, entitled “MODIFYING SEQUENCING CYCLES OR IMAGING DURING A SEQUENCING RUN TO MEET CUSTOMIZED COVERAGE ESTIMATION,” filed on August 2, 2023. Each of the aforementioned applications is hereby incorporated by reference in its entirety. BACKGROUND [0002] In recent years, biotechnology firms and research institutions have improved hardware and software for sequencing nucleotides and determining nucleobase calls for genomic samples. For instance, some existing sequencing machines and sequencing-data-analysis software (together “existing sequencing systems”) predict individual nucleobases within sequences by using conventional Sanger sequencing or sequencing-by-synthesis (SBS) methods. When using SBS, existing sequencing systems can monitor many thousands to billions of oligonucleotides being synthesized in parallel from templates to predict nucleobase calls for growing nucleotide reads. During a sequencing run in many existing sequencing systems, a camera captures images of irradiated fluorescent tags incorporated into oligonucleotides. After capturing such images, some existing sequencing systems determine nucleobase calls for nucleotide reads corresponding to respective clusters of oligonucleotides on a flow cell or other nucleotide-sample substrate for a given sequencing run. For example, some existing sequencing systems utilize sequencing-data- analysis software to analyze image data captured during sequencing cycles to determine nucleobase calls for given clusters of oligonucleotides and sequence such calls across sequencing cycles to determine nucleotide reads for the given clusters. [0003] As part of such improved genomic sequencing, biotechnology firms and research institutions have also improved methods of simultaneously pooling and sequencing large numbers of genomic samples. Existing sequencing systems may pool genetic samples from different individuals to increase the number of samples analyzed in a single sequencing run. For instance, existing sequencing systems may utilize sample multiplexing (or multiplex sequencing) to add individual “barcode” or indexing sequences to each deoxyribonucleic acid (DNA) fragment during library preparation. The indexing sequences correspond to individual genomic samples within the sample pool. After the indexing sequences have been identified, existing sequencing systems may perform demultiplexing to identify which indexing sequences — and which clusters of oligonucleotides on a flow cell — correspond with which genomic samples. [0004] Despite recent advances in multiplexing and per-cycle image analysis, existing sequencing systems cannot accurately determine nucleotide-read coverage for a given genomic sample until after concluding a sequencing run and face other technical shortcomings that vary the level of nucleotide-read coverage for samples provided by read data from a given sequencing run. In multiplexed sequencing, for example, the number of nucleotide fragments from each genomic sample in clusters may not be evenly distributed, leading to variations in nucleotide-read depth or coverage. This uneven representation sometimes results in a sequencing device executing an insufficient number of sequencing cycles or images (or otherwise under-sequencing) for a sequencing run to generate the requisite numbers or length of nucleotide reads that satisfy a target level of coverage for a given sample. While sequencing devices can under-sequence DNA fragments extracted from some samples, sequencing devices can sometimes execute an excessive number of sequencing cycles or images (or otherwise over-sequence) for a sequencing run to generate the requisite numbers or length of nucleotide reads to satisfy the target coverage level. [0005] Due to the uncertainty and variation of the read data coverage for a given sample produced by a given sequencing run, existing sequencing systems often inefficiently consume an inordinate amount of computing time, memory, and consumable materials to compensate for run- to-run variations. Some existing sequencing systems inefficiently consume an inordinate amount of computing time and memory to address under-sequenced samples. For instance, existing sequencing systems often perform additional sequencing cycles during a sequencing run to avoid under-sequencing some samples. The additional sequencing cycles require an excessive amount of computing time, memory, and reagents. As a result of performing additional sequencing cycles within a sequencing run, existing sequencing systems often over-sequence sam