Search

CN-122000019-A - Genetic disease pathogenic site sequencing and diagnosis auxiliary method and system based on multi-mode artificial intelligence

CN122000019ACN 122000019 ACN122000019 ACN 122000019ACN-122000019-A

Abstract

The invention discloses a genetic disease pathogenic site sequencing and diagnosis auxiliary method and system based on multi-mode artificial intelligence, wherein the method comprises the steps of firstly collecting genetic disease data to construct an isomerism knowledge graph; the method comprises the steps of obtaining patient data, executing an analysis flow, respectively generating multi-modal characteristic representations of candidate pathogenic sites from four dimensions of variation characteristics based on rules, a tissue specific mechanism, a protein three-dimensional structure and real-time literature evidence, using a sequencing model to fuse and sequence the characteristics, generating an interpretable report and a clinical examination suggestion based on uncertainty of a sequencing result, and repeating the analysis flow until a preset iteration target is reached after a doctor executes examination according to the suggestion and feeds back new data. The invention breaks through the limitation of a static analysis model, realizes multi-dimensional evidence fusion and clinical workflow embedding, and remarkably improves the interpretation accuracy and diagnosis efficiency of the pathogenic sites of the genetic diseases.

Inventors

  • WANG HAISHUAI
  • LI SHIYU
  • YAN ZIQI

Assignees

  • 浙江大学

Dates

Publication Date
20260508
Application Date
20251225

Claims (10)

  1. 1. The genetic disease pathogenic site sequencing and diagnosis auxiliary method based on the multi-mode artificial intelligence is characterized by comprising the following steps of: s1, collecting a genetic disease data set, and constructing a knowledge graph after preprocessing; s2, receiving initial patient data provided by a user; s3, screening candidate pathogenic mutations step by step based on the provided data and generating corresponding multidimensional feature representations; s3-1, acquiring gene and mutation information based on initial data of a patient, weighting each item of information, and generating a rule feature vector of candidate mutation Screening according to the rule feature vector to obtain a first candidate pathogenic mutation set; S3-2, based on the first candidate pathogenic mutation set, obtaining mutation, tissue and biological pathway information related to the candidate mutation, constructing a heterogeneous graph model for representing biological association relation, performing feature aggregation and representation learning on candidate mutation nodes by using heterogeneous graph neural network calculation, and generating a mechanism feature vector of the candidate mutation Screening according to the mechanism feature vector to obtain a second candidate pathogenic mutation set; S3-3, aiming at the second candidate pathogenic mutation set, modeling and analyzing the protein structure change corresponding to the candidate mutation, extracting structural features reflecting the structural stability change, the conformational disturbance or the functional influence, and generating a structural feature vector of the candidate mutation Screening according to the structural feature vector to obtain a third candidate pathogenic mutation set; s3-4, based on the external knowledge information, carrying out literature evidence supporting analysis on the third candidate pathogenic mutation set, and extracting and quantifying literature evidence feature vectors of candidate mutations ; The feature generation processes in steps S3-1 to S3-3 are configured to introduce auxiliary information related to candidate mutation by retrieving or calling an external database or literature resource, and adjust the influence degree of the external information in the feature generation process through corresponding learnable gating parameters; s4, based on the rule feature vector, the mechanism feature vector, the structure feature vector and the literature evidence feature vector, carrying out joint sorting on the candidate genetic disease pathogenic mutation to generate a sorting result of a third candidate pathogenic site; s5, evaluating uncertainty of the sorting result based on the sorting result, and generating an interpretability report and one or more clinical examination suggestions; And S6, receiving new clinical data fed back by the user after the detection is executed according to the clinical examination suggestion, integrating the new clinical data into a patient data set, and repeating the steps S3 to S5 until a preset iteration termination condition is reached.
  2. 2. The method for assisting in sequencing and diagnosing the pathogenic sites of the genetic disease based on the multi-modal artificial intelligence as set forth in claim 1, wherein the preprocessing procedure in the step S1 comprises the steps of carrying out unified standardized representation on gene mutation data, phenotype data, tissue related data, pathway data, cell data and gene expression data, and carrying out outlier rejection and noise filtering on data which do not meet a preset quality threshold; the knowledge graph adopts a form of entity-relation to carry out explicit modeling on phenotype, gene, mutation, tissue, cell type, molecular path and association relation among the phenotype, and is used for uniformly representing priori knowledge and structural fact information related to genetic diseases.
  3. 3. The method for ranking and aiding diagnosis of a pathogenic site of a genetic disease based on multimodal artificial intelligence as set forth in claim 1, wherein the initial patient data of step S2 comprises a standard variant call format file of the patient and corresponding clinical phenotype description information.
  4. 4. The method for assisting in sequencing and diagnosing the pathogenic sites of genetic diseases based on multi-modal artificial intelligence as set forth in claim 1, wherein the external information retrieved or invoked in the steps S3-1, S3-2 and S3-3 comprises mutation annotation information, gene function information, tissue or cell type specific information, biological pathway information, protein structure related information and database information or literature evidence information related to candidate mutations; And the steps S3-1, S3-2 and S3-3 are respectively provided with mutually independent learnable gating parameters for respectively adjusting the contribution degree of different types of external information in the generation process of the rule feature vector, the mechanism feature vector and the structural feature vector.
  5. 5. The method for assisting in sequencing and diagnosing the pathogenic sites of genetic diseases based on multi-modal artificial intelligence as set forth in claim 1, wherein the obtained genetic and mutation information in step S3-1 includes basic mutation attributes, gene specificity rules, tissue matching degree and external evidence information. The gene specificity rule comprises ACMG/AMP specification, the gene basic variation attribute comprises mutation type, allele frequency and related disease information, the tissue matching degree is calculated based on phenotype and the enrichment degree of a pathway where a gene is calculated in a tissue, and the external evidence comprises a known pathogenic variation database, literature evidence and clinical data.
  6. 6. The method for assisting in sequencing and diagnosing the pathogenic sites of genetic diseases based on multi-modal artificial intelligence according to claim 1, wherein the heterogeneous map model in the step S3-2 comprises variant nodes, gene nodes, phenotype nodes, cell type nodes and biological path nodes, the nodes are connected through edges representing biological association relations, and the weights of the edges are determined based on external database information or statistical analysis results; The heterogeneous map neural network performs representation learning on the heterogeneous map, takes a variation node corresponding to the candidate mutation as a query node, and performs weighted aggregation and feature propagation on information of a gene node, a phenotype node, a cell type node and a biological path node associated with the variation node to generate a mechanism feature vector.
  7. 7. The method for assisting in sequencing and diagnosing the pathogenic sites of genetic diseases based on multi-modal artificial intelligence according to claim 1, wherein the method for analyzing the literature evidence in step S3-4 is realized based on the retrieval and reasoning of the external knowledge of the artificial intelligence model.
  8. 8. The method for ranking and assisting diagnosis of genetic disease causative sites based on multi-modal artificial intelligence as set forth in claim 1, wherein the joint ranking of step S4 is to rank the regular feature vectors in step S3 Mechanism feature vector Structural feature vector Document feature vector Input to the ranking model, the candidate pathogenic mutations are comprehensively scored, and the scoring function can be expressed as follows: Wherein, the Representing a trainable ranking model, ranking the candidate mutations according to the ranking score.
  9. 9. The method for assisting in sequencing and diagnosing the pathogenic sites of the genetic disease based on the multi-modal artificial intelligence as set forth in claim 1, wherein the uncertainty in the step S5 is calculated by a Deep SHAP method to obtain the contribution degree of the rule feature vector, the mechanism feature vector, the structure feature vector and the literature feature vector in the step S3 to the sequencing result in the step S4.
  10. 10. A multi-modal artificial intelligence based genetic disease causative site sequencing and diagnosis assistance system, comprising: the data acquisition and preprocessing module is used for acquiring genetic disease data and preprocessing the data; The knowledge graph module is used for constructing a heterogeneous knowledge graph according to the acquired genetic disease data; The system comprises a learnable gating parameter module, a rule reasoning module, a mechanism analysis module and a structure analysis module, wherein the learnable gating parameter module is used for generating a group of learnable gating parameters for the rule reasoning module, the mechanism analysis module and the structure analysis module; The rule reasoning module is used for inquiring various genetic information; the mechanism analysis module is used for fusing mutation with phenotype, gene, expression quantity and channel information; The structure analysis module is used for inquiring protein structure variation information corresponding to the genes; The literature evidence analysis module is used for inquiring related literature of genes, mutations and phenotypes in real time; The sequencing module is used for integrating the results generated by the rule vector module, the mechanism vector module, the structure vector module and the literature vector module to generate a sequencing result of candidate genetic disease pathogenic sites; An interpretability report module for generating an interpretability report according to the results generated by the vector modules and the ranking module; the iteration module repeats the steps S3 to S5 until reaching the preset iteration termination condition; And the visual interaction module is used for visualizing the clinical report and interacting with a user.

Description

Genetic disease pathogenic site sequencing and diagnosis auxiliary method and system based on multi-mode artificial intelligence Technical Field The invention relates to a genetic disease pathogenic site sequencing and diagnosis auxiliary method and system based on multi-mode artificial intelligence, belonging to the crossing field of computer artificial intelligence and biological information technology. Background Full exon sequencing (Whole Exome Sequencing, WES) is currently the main technical means for molecular diagnosis of genetic diseases. However, its clinical diagnosis rate is generally only between 30% and 50%. This is mainly due to the multiple limitations of existing bioinformatics-based pathogenicity prediction systems, resulting in a large number of "meaningless variations" (Variants of Uncertain Significance, VUS) that are difficult to interpret. Most of the existing models or systems provide static and universal scores based on DNA sequences, neglect transcription and pathway activity differences of the mutation in specific tissues, and cannot explain tissue-specific phenotypes, and most of the systems rely on sequence analysis for missense mutation interpretation, lack the capability of three-dimensional space function verification by using high-precision structural models such as AlphaFold and the like, so that a large number of pathogenic mutations through damage to protein structures or interaction are omitted. At present, most systems still call ClinVar and other static databases, so that automatic, real-time grabbing and quantitative evaluation of newly published literature evidence are difficult to realize, and the method also cannot flexibly and flexibly adapt to dynamic evolution fine interpretation rules such as American medical genetics and genomics society/clinical genome resource (ACMG/ClinGen) guidelines and the like. In addition, most of the existing bioinformatics analysis processes are 'one-time' static analysis, the model cannot actively recommend the next examination with the most information according to the uncertainty in the primary analysis result, and the data obtained by the subsequent examination cannot be fed back to the model for self-correction and weight adjustment, so that an intelligent closed loop of 'prediction-recommendation-verification-re-prediction' cannot be formed, and the practical value of the model in a real-world clinical path is limited. Disclosure of Invention Aiming at the problems and difficulties in the prior art, the invention provides a multi-mode artificial intelligent genetic disease pathogenic site sequencing and diagnosis auxiliary method and system, which are used for automatically integrating multidimensional evidences such as genome, transcriptome, protein structure, literature knowledge and the like by constructing an intelligent system supporting dynamic iteration, thereby improving the accuracy of pathogenic site sequencing, providing an operable diagnosis suggestion for a clinician and forming an intelligent auxiliary closed loop deeply fused with clinical workflow. The first aspect of the invention provides a multi-mode artificial intelligence genetic disease pathogenic site sequencing and diagnosis assisting method, which comprises the following steps: s1, collecting a genetic disease data set, and constructing a knowledge graph after preprocessing; s2, receiving initial patient data provided by a user; s3, screening candidate pathogenic mutations step by step based on the provided data and generating corresponding multidimensional feature representations; s3-1, acquiring gene and mutation information based on initial data of a patient, weighting each item of information, and generating a rule feature vector of candidate mutation Screening according to the rule feature vector to obtain a first candidate pathogenic mutation set; S3-2, based on the first candidate pathogenic mutation set, obtaining mutation, tissue and biological pathway information related to the candidate mutation, constructing a heterogeneous graph model for representing biological association relation, performing feature aggregation and representation learning on candidate mutation nodes by using heterogeneous graph neural network calculation, and generating a mechanism feature vector of the candidate mutation Screening according to the mechanism feature vector to obtain a second candidate pathogenic mutation set; S3-3, aiming at the second candidate pathogenic mutation set, modeling and analyzing the protein structure change corresponding to the candidate mutation, extracting structural features reflecting the structural stability change, the conformational disturbance or the functional influence, and generating a structural feature vector of the candidate mutation Screening according to the structural feature vector to obtain a third candidate pathogenic mutation set; s3-4, based on the external knowledge information, carrying out literature evidence suppor