CN-121999865-A - Integrated genome analysis method based on low-depth sequencing

CN121999865ACN 121999865 ACN121999865 ACN 121999865ACN-121999865-A

Abstract

The invention relates to the field of biological medicine and discloses an integrated genome analysis method based on low-depth sequencing, which comprises the steps of obtaining low-depth sequencing data of a whole genome; the SNV, indel and CNV are calculated after quality control and comparison, multi-dimensional functional annotation is carried out by combining genome position, coding influence, splicing disturbance, conservation, regulatory elements and three-dimensional chromatin interaction, database information such as ClinVar, HGMD and the like is integrated, and a clinical interpretable report is generated according to ACMG/AMP standard by a rule engine driven by a phenotype. The invention overcomes the fundamental defect that the traditional low-depth sequencing only outputs an original variation list but cannot provide clinical action basis by constructing a complete analysis chain for covering sequence variation detection, multidimensional function annotation, three-dimensional genome association, public database integration and phenotype driving interpretation.

Inventors

YAN LIYING
CHEN YIDONG
SONG SHI
Li hanna
GUAN SHUO
WANG YUQIAN
QIAO JIE
QIN MENG
ZHU XIAOHUI
KUO YING
WANG YUN
YANG JUN
LI JIACHENG
YAN ZHIQIANG

Assignees

北京大学第三医院（北京大学第三临床医学院）

Dates

Publication Date: 20260508
Application Date: 20260122

Claims (10)

1. An integrated genome analysis method based on low depth sequencing, comprising: acquiring whole genome low-depth sequencing original data of an individual to be tested; performing quality control filtering and comparison processing on the original data, calculating genome variation, and generating a standardized variation calling file; extracting all mononucleotide mutation sites, small fragment indels and copy number mutation areas based on the mutation calling file; Carrying out multi-level functional annotation on the variation sites and the regions, wherein the multi-level functional annotation comprises genome position annotation, coding region influence prediction, splice site disturbance evaluation, conservation score calculation, regulatory element overlapping analysis and three-dimensional chromatin interaction correlation inference; Integrating known clinical related variation information in a public pathogenicity database, and sequencing the priority of the functional annotation results; In conjunction with the individual phenotype information, a rule-based decision engine is employed to generate a final clinical interpretability report.
2. The integrated low depth sequencing-based genome analysis method according to claim 1, wherein the obtaining of whole genome low depth sequencing raw data of an individual to be tested comprises: collecting peripheral blood samples of individuals, and extracting genome deoxyribonucleic acid; Constructing a library by adopting random primers, and controlling the sequencing depth to be within a coverage range of 0.5-3 times; Double-ended read sequences were generated using a high throughput sequencing platform, with a read length of 150 base pairs.
3. The integrated low depth sequencing-based genome analysis method of claim 2, wherein the quality control filtering and alignment of raw data comprises: performing mass truncation on the read by using a sliding window method, and removing bases with mass values lower than 20; Removing reads containing adapter-contaminated or low complexity sequences; Comparing the filtered reads to human reference genome version GRCh38, the comparison algorithm employing BWA-MEM; after alignment, carrying out repeated sequence marking, local alignment and base quality recalibration treatment; based on the processed alignment file, GATK HaplotypeCaller is called to carry out single nucleotide variation and small fragment indel identification, and CNVkit tool is adopted to carry out copy number variation region detection.
4. The method for analyzing the integrated genome based on the low-depth sequencing according to claim 3, wherein in the multi-level functional annotation, genome position annotation is used for determining whether variation is located in an exon, an intron, a promoter, an enhancer, an insulator or a silent subregion, coding region influence prediction adopts SnpEff tools to judge amino acid substitution, premature termination or frame shift variation, splice site disturbance evaluation calculates donor and acceptor site score variation through SpliceAI algorithm, threshold value is set to be more than 0.2, conservation score calculation adopts a GERP ++ and PhyloP combined scoring system, comprehensive score is higher than 2 and is regarded as high conservation, regulatory element overlapping analysis is used for judging whether variation falls into an active promoter or a strong enhancer region based on a chromatin state map defined by ENCODE and Roadmap Epigenomics items, three-dimensional chromatin interaction correlation deduces a chromatin ring structure constructed by Hi-C data, and judging whether non-coding region variation has physical interaction with a remote gene promoter.
5. The integrated low depth sequencing-based genome analysis method of claim 4, wherein integrating known clinically relevant variation information in a common pathogenicity database comprises: querying ClinVar variant entries with definite clinical significance recorded in the database; Comparing the common polymorphic site frequencies in the dbSNP database, and excluding benign polymorphisms with the crowd frequency higher than 1%; Searching reported pathogenic variation in the HGMD professional database; Invoking somatic mutation hotspot information in the COSMIC database to aid in tumor-related interpretation; the databases are deployed locally and updated periodically and synchronously.
6. The integrated low depth sequencing-based genome analysis method of claim 5, wherein prioritizing functional annotation results comprises: firstly screening out the mutation which is located in the coding region of the known pathogenic gene and causes the loss of protein function; Secondly, non-coding variation which is positioned in the regulatory element and has three-dimensional interaction with the disease related gene is included; Consider again the new variation with high conservation, significant splice effects but not recorded by the database; Finally, neutral variation in the genetic desert area or non-overlapping with any functional element is eliminated; Different evidence grade weights are given in the sorting process.
7. The integrated low depth sequencing-based genome analysis method of claim 6, wherein the employing a rule-based decision engine in conjunction with individual phenotype information to generate a final clinical interpretability report comprises: receiving a human phenotype term standard code entered by a clinician; Mapping the code to relevant disease entries in the OMIM and Orphanet database; Extracting a list of pathogenic genes associated with the disease entry; screening the mutation in the gene list from the priority sorting result; according to 20 evidence standards formulated by ACMG/AMP guidelines, each candidate variation is pathogenically classified, and classification results comprise pathogenicity, possible pathogenicity, unknown meaning, possible benign and benign 5 types; a structured report is generated that includes variant positions, allele frequencies, functional predictions, database support, degree of phenotype matching, and final classification conclusions.
8. The integrated low depth sequencing-based genome analysis method according to claim 7, wherein the protein loss of function is defined as nonsense variation, frameshift indels, classical splice site variation or whole-segment exons, and the rule of application of ACMG/AMP evidence standard comprises: PS1 and PM5 need to match amino acid position and change type strictly, PM1 is defined as mutation in mutation hot spot region and/or key functional domain without known benign mutation, PM2 is defined as gnomAD frequency lower than 1/1000, PM4 is defined as protein length change caused by deletion/insertion or stop codon loss in non-repeated region frame, PP3 needs at least two algorithms to support harm, BP4 needs most algorithms to support benign.
9. The integrated low depth sequencing-based genome analysis method of claim 8, wherein the regulatory element overlap analysis uses a ChromHMM generated 25-state model to determine if a variation falls within an active promoter, a strong enhancer, or a weak enhancer region, overlap determination uses BEDToolsintersect tools, and requires at least one base overlap to be positive.
10. The integrated genome analysis method based on low depth sequencing according to claim 9, wherein the three-dimensional chromatin interaction correlation inference uses a high resolution Hi-C interaction matrix from GM12878 cell line, the resolution is 5000 base pairs, chromatin loop anchor pairs are extracted by Juicebox tool, non-coding region variation coordinates are matched with all loop anchor points, if the variation is located within an anchor point of a loop and another anchor point of the loop covers a promoter region of a known pathogenic gene, a physical interaction correlation is determined.

Description

Integrated genome analysis method based on low-depth sequencing Technical Field The invention belongs to the field of biological medicine, and particularly relates to an integrated genome analysis method based on low-depth sequencing. Background With the popularity of high throughput sequencing technologies, genomic analysis plays an increasingly critical role in disease screening, accurate medical treatment, and population genetic research. Low-depth whole genome sequencing (low-pass whole genome sequencing) is widely used for copy number variation detection and genotyping in large-scale cohort studies due to its low cost and high throughput. However, this strategy lacks systematic, multidimensional assessment of the functional status of non-coding regions, resulting in a large number of potentially pathogenic changes located in regulatory elements that cannot be annotated effectively, severely limiting their value in clinical interpretation. Functional annotation based on epigenomic features becomes a key direction to enhance the depth of interpretation of the variation. Apparent markers such as chromatin open areas, histone modification enrichment sites and the like can indicate gene regulation activity and provide a biological context for non-coding variation. The existing method depends on the apparent data (such as ATAC-seq and ChIP-seq) measured by experiments, but the data has high acquisition cost and strong tissue specificity, and is difficult to integrate in a large scale in a conventional low-depth sequencing process. The problems of lack of functional annotation, weak recognition capability of a regulatory region, poor generalization of a model and the like generally exist in a low-depth sequencing background. The traditional annotation tool depends on a static database, can not dynamically adapt to regulation and control landscapes under different cell types or pathological states, directly introduces a complex deep learning model, faces the bottlenecks of large calculation resource consumption, scarce training data and the like, and is difficult to deploy in a conventional analysis pipeline. In the context of rapid clinical interpretation, there is a need for an integrated analysis method that is lightweight, mobile, and capable of preferentially focusing on the regulatory relevant variation regions to maximize the functional significance of mining genomic variation at limited sequencing depth. Disclosure of Invention The invention provides an integrated genome analysis method based on low-depth sequencing, and aims to solve the technical defect that the existing low-depth sequencing technology only focuses on gene sequence variation detection and lacks systematic function annotation and clinical meaning analysis on variation sites. The method carries out deep fusion on original sequencing data, genome structural characteristics, epigenetic regulation and control information, a transcription regulation and control network and a known pathogenicity database by constructing a multi-level bioinformatics processing flow, and realizes end-to-end automatic analysis from original sequence reading to a variation report with clear clinical interpretation value. The integrated genome analysis method based on the low-depth sequencing comprises the following steps of obtaining whole genome low-depth sequencing original data of an individual to be tested, carrying out quality control filtering and comparison processing on the original data, calculating genome variation and generating a standardized variation calling file, extracting all single nucleotide variation sites, small fragment indels and copy number variation regions based on the variation calling file, carrying out multi-level functional annotation on the variation sites and the regions, wherein the multi-level functional annotation comprises genome position annotation, coding region influence prediction, splice site disturbance evaluation, conservation score calculation, regulatory element overlapping analysis and three-dimensional chromatin interaction correlation inference, integrating known clinical related variation information in a public pathogenicity database, carrying out priority ordering on the functional annotation result, combining individual phenotype information, and adopting a rule-based decision engine to generate a final clinical interpretability report. Further, the method for obtaining the whole genome low-depth sequencing original data of the individual to be tested specifically comprises the steps of collecting peripheral blood samples of the individual, extracting genome deoxyribonucleic acid, constructing a library by adopting random primers, controlling the sequencing depth to be within a coverage range of 0.5-3 times, and generating double-end reading sequence with a high-throughput sequencing platform, wherein the reading length is 150 base pairs. The quality control filtering and comparison processing of the original