CN-121999874-A - Gene library acquisition method for improving single cell amplified genome coverage

CN121999874ACN 121999874 ACN121999874 ACN 121999874ACN-121999874-A

Abstract

The invention relates to the field of biological information, and discloses a gene library acquisition method for improving single cell amplification genome coverage. The gene library acquisition method provides a strategy for constructing a compound library and integrating optimized data aiming at the problem of insufficient genome coverage of single cell amplification, and comprises the steps of constructing at least one original library and one sub-library with different average fragment lengths aiming at the same single cell sample, sequencing, carrying out Barcode resolution and effective single cell identification on sequencing data through a customizable sequence characteristic detection method, and further identifying, collecting and merging all sequencing sequence data of the same single cell according to the identification result of the effective single cell to generate a corresponding single cell integrated sequence data set to obtain the single cell gene library data. The SAG coverage of the obtained gene library data is high, and the genome analysis for the single cell downstream has higher reliability and application value.

Inventors

CHAO SHAN
LIN JUNKAI
YU JIALE
ZHENG GUANTAO
PEI HAO

Assignees

墨卓生物科技(浙江)有限公司
上海墨卓生物科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260408

Claims (10)

1. A gene library acquisition method for improving single cell amplification genome coverage is characterized by comprising the following steps: Constructing a gene amplification library aiming at the same single cell amplification genome sample to be detected, wherein the constructing of the gene amplification library comprises constructing at least one original library and constructing at least one sub-library, and the average fragment length of the original library is larger than that of the sub-library; Performing high-throughput sequencing on the original library and the sub-library to obtain corresponding sequencing sequence data; identifying and collecting sequencing sequence data from the same single cell according to a cell barcode sequence contained in the sequencing sequence data so as to distinguish effective sequencing sequence data of single cell amplified genome of the same single cell; And (3) carrying out data merging on sequencing sequence data from the same single cell to obtain gene library data of single cell amplified genome of the same single cell.
2. The method for obtaining a gene library for improving coverage of a single cell amplified genome of claim 1, wherein said constructing a gene amplified library comprises: Pre-amplifying nucleic acids extracted from a single cell sample to obtain a pre-amplified product comprising long fragments; Performing primary purification treatment on the pre-amplified product to construct a raw library; and (3) carrying out secondary purification treatment and fragmentation treatment on the pre-amplified product after the first purification to obtain short fragments which accord with a preset length range so as to construct a sub-library.
3. The method for obtaining a gene library for improving coverage of a single cell amplified genome of claim 2, wherein said original library has an average fragment length of 600 bp to 5 kb and is not fragmented, and said sub library has an average fragment length of 300 bp to 500 bp.
4. A method of gene library acquisition for improved single cell amplified genome coverage according to claim 1, wherein said identifying and aggregating sequencing sequence data from the same single cell comprises: Defining characteristic sequence attributes of cell barcodes and unique molecular identifiers in a sequencing sequence according to a predefined library structure configuration file, wherein the attributes at least comprise sequence positions, lengths and fault-tolerant matching rules with a valid sequence list; organizing the valid sequence list by utilizing a prefix tree data structure to support fast fault-tolerant matching; Positioning a characteristic sequence area according to the library structure configuration file for each sequencing sequence, and performing fault-tolerant matching on the sequence positioned in the area and the effective sequence list by using the prefix tree; And collecting the successfully matched sequencing sequences into single cell groups corresponding to the cell bar codes according to the matched effective cell bar codes.
5. The method of claim 4, wherein the step of identifying and collecting sequencing sequence data from the same single cell is followed or during the step of identifying and collecting the single cell, and the step of identifying the single cell is performed: Counting the number of unique molecular identifiers associated with each identified cell barcode against the sequencing sequence data of the raw library; Sorting all cell barcodes in descending order according to the number of unique molecular identifiers; And identifying a cell bar code set containing the single-cell amplified genome of the effective single cell from the ordered list based on the ordering result and a preset threshold value.
6. A method for obtaining a gene library for improving single cell amplified genome coverage according to claim 5, wherein said data combining step comprises: And extracting and combining all sequencing sequence data which are identified as being from effective single cells in the original library and all sequencing sequence data carrying the same cell bar codes in the sub-library based on the data analysis result of the original library to generate the gene library data of the single cell amplified genome of the same single cell.
7. The method for obtaining a gene library for improving coverage of a single cell amplified genome of claim 4, wherein said identifying and aggregating sequencing sequence data from the same single cell further comprises a deduplication step for the genomic sequencing data, comprising: comparing the sequencing sequence data grouped by cell bar code and unique molecular identifier to target fragment areas of each sequence in the same group; Generating a consensus sequence based on the comparison result; And when the proportion of the number of the original sequences supporting the consensus sequence to the total number of the sequences of the group exceeds a preset threshold, outputting the consensus sequence as a unique representative sequence of the group.
8. A method for obtaining a gene library for improving single cell amplified genome coverage according to claim 1, wherein said high throughput sequencing of said protolibraries and said sub-libraries comprises: And carrying out differential distribution on the sequencing data volume of the original library and the sub-library, wherein the data volume distributed to the sub-library is not lower than the data volume distributed to the original library.
9. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, realizes the steps of the method according to any one of claims 1 to 8.
10. A data processing system comprising a processor and a memory, characterized in that: the memory stores a computer program; the processor is configured to execute the computer program to implement the steps of the method according to any one of claims 1 to 8.

Description

Gene library acquisition method for improving single cell amplified genome coverage Technical Field The invention relates to the field of biological information, relates to gene sequence data acquisition of single-cell amplified genome, and in particular relates to a method for acquiring a single-cell amplified genome library with high coverage. Background Single cell sequencing technology, in particular single cell microbial genome sequencing (microbe-seq), provides a high resolution perspective for studying genetic heterogeneity, species composition, functional potential, and antibiotic resistance of individual cells in a microbial community by sequencing as an important means of detecting microbial genes. The technology can obtain genome data representing single microbial cells, namely single cell amplified genome (SAG), by wrapping single microbial cells in microfluidic droplets and combining whole genome amplification with high throughput sequencing. Because the single-cell initial DNA amount is extremely small, after whole genome amplification, the coverage ratio of the finally obtained sequencing sequence to the target microorganism genome is low due to factors such as amplification preference, nonlinear amplification, limited sequencing data amount and the like. Thus, single cell sequencing technology currently faces a common and prominent problem of insufficient single SAG coverage, i.e., a low ratio of sequencing reads to target genomes for single cell amplified genomes. Lower coverage can severely impact the depth and reliability of downstream analysis, for example, leading to incomplete genome assembly, loss of gene annotation, and difficulty in performing accurate Single Nucleotide Polymorphism (SNP) or Copy Number Variation (CNV) analysis, limiting the value of the technology in accurate microbiome studies. Currently, efforts to improve single cell genome data quality have focused mainly on optimization of experimental procedures, such as improving amplification enzymes or pool building reagents, so that sequencing reads have improved coverage of the target genome. However, these methods tend to be costly, complex in process, and have limited boosting effect. At the bioinformatics analysis level, the existing single-cell sequencing data processing software (such as STARsolo, cellranger and the like) is mainly designed aiming at the transcriptome of the mammalian cells, and the design premise is that the library structure is fixed. The software generally adopts a white list matching method based on a hash table and a Hamming distance, has low processing speed when the fault tolerance requirement is high or the white list is large, and cannot flexibly adapt to diversified and nonstandard library structures of different single-cell microorganism library building platforms (such as Bacdrop, smRandom-seq and the like). More importantly, the existing analytical procedures typically perform data processing on single-type sequencing libraries, lacking an effective strategy to intelligently integrate different types of sequencing library data from the same single cell sample to systematically enhance final genome coverage. Therefore, there is a need in the art for an effective single-cell sequencing data processing method that combines flexibility and efficiency, thereby fundamentally solving the core problem of insufficient coverage of single-cell amplified genome. Disclosure of Invention In order to solve the technical problem of insufficient coverage of the single-cell amplified genome, the invention provides a gene library acquisition method for improving the coverage of the single-cell amplified genome, and the coverage of the obtained single-cell amplified genome library data is effectively improved through a strategy of constructing and optimizing data integration of a composite library, so that a higher-quality data basis is provided for subsequent accurate genome assembly, gene annotation and mutation analysis. The specific technical scheme of the invention is as follows: the invention provides a gene library acquisition method for improving single cell amplification genome coverage, which comprises the following steps: Constructing a gene amplification library aiming at the same single cell amplification genome sample to be detected, wherein the constructing of the gene amplification library comprises constructing at least one original library and constructing at least one sub-library, and the average fragment length of the original library is larger than that of the sub-library; Performing high-throughput sequencing on the original library and the sub-library to obtain corresponding sequencing sequence data; identifying and collecting sequencing sequence data from the same single cell according to a cell barcode sequence contained in the sequencing sequence data so as to distinguish effective sequencing sequence data of single cell amplified genome of the same single cell; And (3) carrying o