Search

CN-121983126-A - High-throughput gene sequencing data analysis system based on deep learning

CN121983126ACN 121983126 ACN121983126 ACN 121983126ACN-121983126-A

Abstract

The invention discloses a high-throughput gene sequencing data analysis system based on deep learning, in particular to the technical field of bioinformatics, which comprises the steps of firstly carrying out quality control and unified coordinate mapping on multisource sequencing reads to construct unified tensor representation; the method comprises the steps of obtaining a depth feature, extracting network fusion external information through the depth feature, generating a comprehensive feature vector, calculating a variation comprehensive risk index by utilizing a gradient lifting tree model, integrating longitudinal variation risk and immune receptor rearrangement sequencing data, reconstructing a clone system development structure, estimating the internal growth rate, the competition coefficient and the immune clearing intensity coefficient of each clone to obtain a clone abundance time sequence consistent with time and space, finally constructing a clone evolution feature vector, mapping the clone evolution feature vector into an immune escape risk index through a support vector machine model, realizing the sequencing classification of high-risk clones, and reserving a complete chain capable of backtracking to a bottom evidence in the whole process.

Inventors

  • Qi Xianjia
  • CHEN PING
  • ZHAO PENGYANG
  • LI HAIMIN
  • QI YUAN

Assignees

  • 上海旭燃生物科技有限公司

Dates

Publication Date
20260505
Application Date
20251216

Claims (10)

  1. 1. The high-throughput gene sequencing data analysis system based on deep learning is characterized by comprising the following modules: the data fusion expression module is used for carrying out quality control and unified coordinate mapping on the multisource sequencing read so as to construct unified tensor representation; The mutation feature generation module is connected with the data fusion expression module and is used for generating depth feature vectors of candidate mutation sites based on the unified tensor representation, forming comprehensive feature vectors by fusing external information, and calculating mutation comprehensive risk indexes by using a machine learning model; The clone immune dynamic analysis module is connected with the variation characteristic generation module and is used for integrating variation comprehensive risk indexes at a plurality of time points with immune receptor rearrangement sequencing data, reconstructing a clone system development structure, estimating the internal growth rate, the inter-clone competition coefficient and the immune clearance intensity coefficient of each clone, and thus obtaining a clone abundance time sequence; the future risk reasoning module is connected with the clone immune dynamic analysis module and is used for constructing a clone evolution feature vector based on the clone abundance time sequence, the internal growth rate of clones, the inter-clone competition coefficient, the immune clearance intensity coefficient and the mutation comprehensive risk index, and mapping the clone evolution feature vector into the immune escape risk index so as to realize the sequencing and grading of high-risk clones.
  2. 2. The high-throughput gene sequencing data analysis system of claim 1, wherein the data fusion expression module is configured to normalize or normalize each characteristic channel and then to set a deletion marker for genomic regions with too low coverage and which are not detectable.
  3. 3. The high-throughput gene sequencing data analysis system of claim 1, wherein the variance feature generation module generates the depth feature vector via a convolutional neural network.
  4. 4. The deep learning based high throughput gene sequencing data analysis system of claim 3, wherein said convolutional neural network comprises at least two convolutional layers and a global pooling layer.
  5. 5. The high-throughput gene sequencing data analysis system based on deep learning of claim 1, wherein the machine learning model used for calculating the mutation comprehensive risk index in the mutation characteristic generation module is a gradient lifting tree model.
  6. 6. The deep learning based high throughput gene sequencing data analysis system of claim 1, wherein the clone immunopotency analysis module fits and estimates clone abundance time sequences by constructing a kinetic model that simultaneously couples clone autonomous proliferation capacity characterized by intrinsic growth rates of each clone, inter-clone interactions characterized by inter-clone competition coefficients, and immune system clearance effects characterized by immune clearance intensity coefficients.
  7. 7. The deep learning-based high throughput gene sequencing data analysis system of claim 6, wherein parameters of the kinetic model are estimated using a nonlinear least squares method and biological rationality constraints are imposed in the solution process.
  8. 8. The deep learning-based high throughput gene sequencing data analysis system of claim 1, wherein the feature vectors of the clone evolution constructed by the future risk reasoning module comprise trace features extracted from the time sequence of clone abundance, internal growth rate of clone, immune clearing intensity coefficient, competition coefficient among clone, and derivative features obtained by statistically summarizing mutation comprehensive risk indexes of all mutation inside the clone.
  9. 9. The deep learning-based high throughput gene sequencing data analysis system of claim 1, wherein a support vector machine model is used in a future risk reasoning module to map the clone evolution feature vector to an immune escape risk index.
  10. 10. The deep learning-based high-throughput gene sequencing data analysis system of claim 1, wherein a complete traceable chain is established between the modules, so that the finally output immune escape risk index can be traced back to the underlying original sequencing evidence and related external information.

Description

High-throughput gene sequencing data analysis system based on deep learning Technical Field The invention relates to the technical field of bioinformatics, in particular to a high-throughput gene sequencing data analysis system based on deep learning. Background With the development of high-throughput gene sequencing and immune checkpoint inhibitors and other immune treatment technologies, exome sequencing, whole genome sequencing, circulating tumor DNA sequencing and immune receptor rearrangement sequencing have been gradually applied to drug decisions and efficacy evaluation of patients with advanced cancer. In clinical practice, it is often dependent on the deep sequencing results of single or few tumor biopsy samples, combined with mutation spectra, tumor mutation load and cloning structure prediction, to formulate or adjust an immunotherapy regimen lasting for several months. However, tissue biopsies have the characteristics of strong invasiveness, high risk and difficulty in high-frequency repetition, single-point sampling is also easily affected by heterogeneity of internal space of tumor, and comprehensive representation of tumor clone diversity is difficult. The existing analysis flow mostly regards sequencing data at different time points as static and mutually independent snapshots, lacks a unified time reference and a systematic longitudinal integration method, and does not dynamically model the growth, attenuation and competition processes of tumor clones under treatment selection pressure and immune selection pressure. Disclosure of Invention In order to overcome the above-mentioned drawbacks of the prior art, embodiments of the present invention provide a deep learning-based high throughput gene sequencing data analysis system to solve the problems set forth in the background art. In order to achieve the above purpose, the present invention provides the following technical solutions: A deep learning-based high throughput gene sequencing data analysis system comprising the following modules: the data fusion expression module is used for carrying out quality control and unified coordinate mapping on the multisource sequencing read so as to construct unified tensor representation; The mutation feature generation module is connected with the data fusion expression module and is used for generating depth feature vectors of candidate mutation sites based on the unified tensor representation, forming comprehensive feature vectors by fusing external information, and calculating mutation comprehensive risk indexes by using a machine learning model; The clone immune dynamic analysis module is connected with the variation characteristic generation module and is used for integrating variation comprehensive risk indexes at a plurality of time points with immune receptor rearrangement sequencing data, reconstructing a clone system development structure, estimating the internal growth rate, the inter-clone competition coefficient and the immune clearance intensity coefficient of each clone, and thus obtaining a clone abundance time sequence; the future risk reasoning module is connected with the clone immune dynamic analysis module and is used for constructing a clone evolution feature vector based on the clone abundance time sequence, the internal growth rate of clones, the inter-clone competition coefficient, the immune clearance intensity coefficient and the mutation comprehensive risk index, and mapping the clone evolution feature vector into the immune escape risk index so as to realize the sequencing and grading of high-risk clones. In a preferred embodiment, the data fusion expression module, when constructing a unified tensor representation, first performs a normalization or normalization process on each of the characteristic channels, and then sets deletion markers on genomic regions that are too low in coverage and undetectable. In a preferred embodiment, the variant feature generation module generates the depth feature vector by convolving a neural network. In a preferred embodiment, the convolutional neural network comprises at least two convolutional layers and one global pooling layer. In a preferred embodiment, the machine learning model used in the mutation feature generation module to calculate the mutation integrated risk index is a gradient-lifting tree model. In a preferred embodiment, the clone immunopotentiator analysis module fits and estimates clone abundance time sequences by constructing a kinetic model that simultaneously couples clone autonomous proliferation capacity characterized by the intrinsic growth rate of each clone, inter-clone interactions characterized by the inter-clone competition coefficients, and immune system clearance effects characterized by the immune clearance intensity coefficients. In a preferred embodiment, the parameters of the kinetic model are estimated using a nonlinear least squares method and biological rationality constraints are imposed in the solut