CN-121983166-A - Rare disease drug redirection path mining method based on knowledge graph

CN121983166ACN 121983166 ACN121983166 ACN 121983166ACN-121983166-A

Abstract

The invention discloses a rare disease drug redirection path mining method based on a knowledge graph, which comprises the steps of collecting data from a public database and a clinical environment, dividing the collected data into structured data and unstructured data, constructing a preset rule, processing the structured data to obtain a triplet, obtaining the knowledge graph based on the triplet data, extracting positive samples and negative samples from the knowledge graph, training a loss function optimization model, extracting all paths conforming to templates based on the knowledge graph, inputting the paths conforming to the templates into the model to obtain path scores, and achieving the purposes of improving consistency and reliability of the graph and realizing interpretable reasoning of potential relations between diseases and drugs.

Inventors

JIN BO
Chen Kongyu
ZHANG LIANG

Assignees

大连理工大学

Dates

Publication Date: 20260505
Application Date: 20251222

Claims (6)

1. The rare disease drug redirection path mining method based on the knowledge graph is characterized by comprising the following steps of: step 1, constructing a knowledge graph; The method comprises the steps of 1-1, collecting data, a public database comprising ：DrugBank、ChEMBL、UniProt、HGNC、DisGeNET、OMIM、Orphanet、ClinicalTrials.gov、PubMed、Reactome、KEGG、Pathway Commons and SIDER, multi-dimensional data covering drug information, target proteins, genes, diseases, passages and adverse reactions, clinical data comprising real world data from Margariti syndrome and including brief structural information in electronic medical records, wherein the brief structural information comprises diagnosis, medication records and inspection results, 1-2, constructing triplets and metadata, dividing the collected data into structural data and unstructured data, wherein the structural data comprises JSON files, table data and database tables, the unstructured data comprises literature texts, clinical report summaries and database annotation texts, and constructing preset rules to process the structural data to obtain triplets; Step 1-3, importing all triples and metadata into a Neo4j graph database, constructing a preliminary knowledge base to be optimized, and performing structural optimization on the knowledge base to be optimized, wherein the steps include merging entity standardization and synonymous entities, filtering with low confidence level, and obtaining a knowledge graph after the knowledge base to be optimized completes structural optimization; step 2, constructing a drug redirection path mining model based on a knowledge graph; step 2-1, extracting positive examples from the knowledge graph And negative examples Positive examples are known disease-drug associations, negative examples are randomly drawn irrelevant drug and disease pairs; step 2-2, extracting all paths conforming to the template from the knowledge graph for each disease-drug pair to form a candidate path set Comprises the steps of disease, gene, protein, medicine, disease, channel, protein, medicine, disease, gene, channel, medicine, disease, protein, signal channel, medicine, preparing the medicine Set of candidate paths And the related metadata are taken as input, and the first is output Strip path Scoring of And interpreting information The formula is: Wherein, the Including node type, relationship type and statistical information in the path, Scoring paths including source database, literature evidence, confidence Paths of paths above the set threshold of 0.7 to 0.8 are used as candidate results, and a final output path score The high path provides support for subsequent clinical verification.
2. The rare disease drug redirection path mining method based on the knowledge graph according to claim 1, wherein the construction of a preset rule is to process the structured data, and the specific rule is as follows: Extracting field contents from the JSON file through traversing key value pairs and generating triples; Generating triples by extracting required column fields line by line from table data; selecting a designated field from the database table by executing SQL sentences and converting the designated field into triples according to rules; Entity and relation extraction is carried out by using a large language model LLM, so as to obtain triples; Each triplet contains corresponding metadata including an entity source database, a relationship source, a data timestamp, an original text fragment, and a document ID.
3. The knowledge-based rare drug redirection path mining method of claim 1, wherein entity normalization combined with synonymous entities comprises: Extracting important keywords according to expert opinions, adopting a regular matching substitution mode for the important keywords, manually making a regular matching rule to replace original words according to synonymous entities existing in a knowledge graph so as to carry out standardized naming, and calculating semantic vector similarity of the remaining non-standardized entities through Sentence-BERT, wherein the entities with similarity greater than 0.85 are regarded as synonymous and are combined; the low confidence edge filtering includes: One triplet, namely one side in the knowledge graph, comprises two connected entities and corresponding relations thereof, and the confidence of the side is calculated according to an entity source database and a relation source recorded in metadata, and the formula is as follows: Wherein, the Representing the confidence level (0-1) of edge e; the credibility of the entity source database is represented, namely, an authoritative database is 1.0, a secondary database is 0.9, and the clinical data source is 0.8; The reliability of the relation source is represented, namely, the relation 1.0 marked by manual arrangement/expert, the relation 0.9 recorded by the authoritative database, the relation 0.8 derived from the literature, And The data source credibility weight and the evidence support weight respectively According to empirical settings, e.g. Typically 0.8 is taken as the low confidence threshold, The edges are removed and the entity source database includes an authoritative database DrugBank, chEMBL, uniProt, HGNC, disGeNET, OMIM, orphanet, a secondary database Reactome, KEGG, path communications, SIDER, clinical trims gov, and clinical data.
4. The knowledge-based rare disease drug redirection path mining method according to claim 1, wherein step 2 further comprises: The constructed knowledge graph is recorded as follows: Wherein, the For a set of nodes, As a set of relationship types, For the edge set, the update formula embedded by the R-GCN node is as follows: Wherein, the Represent the first Layer node Is provided with an embedding in the substrate, Representation and node A set of neighbor nodes of the relationship type r, Representing nodes Neighbor nodes with relation type r Is provided with an embedding in the substrate, Is a weight matrix of the corresponding relationship type, Is the first The layer is self-loop of the weight matrix, For the normalization constant(s), Is a nonlinear activation function; record the last layer of R-GCN, namely the embedding of the node of the L layer as By maximizing the positive sample edge and minimizing the negative sample edge prediction probability, the formula is: Wherein, the Is embedded into the node of the disease, For the embedding of the drug node, Transpose the operator for the matrix; The training loss adopts cross entropy, and the formula is: 。
5. The knowledge-graph-based rare disease drug redirection path mining method of claim 1, wherein the positive examples are known disease-drug associations comprising atenolol Atenolol improving diastolic function of the cardiovascular system of a patient with Margaret syndrome, losartan Losartan reducing risk of aortic dilation, and propranolol Propranolol controlling heart rate abnormalities.
6. The knowledge-based rare disease drug redirection path mining method according to claim 1, wherein the negative sample is a randomly extracted irrelevant drug and disease pair comprising mahalanobis syndrome-amoxicillin Amoxicillin, mahalanobis syndrome-ibuprofen Ibuprofen, mahalanobis syndrome-omeprazole Omeprazol.

Description

Rare disease drug redirection path mining method based on knowledge graph Technical Field The invention relates to the technical fields of biomedical informatics and artificial intelligence, in particular to a rare disease drug redirection path mining method based on a knowledge graph. Background With the rapid development of biomedical data and artificial intelligence technology, drug redirection is an important means for improving the development efficiency of new drugs. Rare diseases due to the limited number of patients and the scarcity of clinical samples, the traditional medicine has long research and development period and high cost, and the potential therapeutic medicine needs to be discovered by a data-driven method. There are a variety of biomedical databases currently available, including disease, gene, protein, drug and clinical trial multidimensional information, providing the basis for systematic research. However, these data sources are scattered, structurally heterogeneous, and difficult to apply directly. The knowledge graph technology provides a new thought for the relation mining of diseases and medicines by integrating multi-source data and constructing a semantic association network. Meanwhile, the graph neural network GNN is excellent in graph structure data modeling, and potential disease-drug association can be mined through link prediction. However, the existing researches are mainly focused on common diseases, rare diseases have sparse data and insufficient evidence, so that the reliability and the interpretability of the prediction result are low. Therefore, a rare disease drug redirecting method based on a knowledge graph and a graph neural network is needed to integrate multi-source data, optimize model performance, improve result interpretation and provide support for rare disease drug discovery. In the present invention, R-GCN comes out SEJR SCHLICHTKRULL M, N. KIPF T, BLOEM P, etc. Modeling Relational Data with Graph Convolutional Networks[C]//2018 European Semantic.Web.Conference.Springer,Cham,2018:593-607.https://www.microsoft.com/en-us/research/publication/modeling-relational-data-with-graph-convolutional-networks/. Combining element path templates ZHANG M L, ZHAO B W, SU X R, etc. RLFDDA: a meta-path based graph representation learning model for drug–disease association prediction[J].BMC.Bioinformatics,2022,23(1):516.DOI:10.1186/s12859-022-05069-z. Large language model LLM comes out ZHOU S, YU S. High-throughput biomedical relation extraction for semi-structured web articles empowered by large language models[J]. BMC.Medical.InformaticsandDecisionMaking,2025,25(1):351.DOI:10.1186/s12911-025-03204-3. DrugBank out of WISHART D S, KNOX C, GUO A C, etc. DrugBank: a comprehensive resource for in silico drug discovery and exploration[J]. Nucleic Acids Research, 2006, 34(suppl_1): D668-D672. DOI:10.1093/nar/gkj067. ChEMBL out of GAULTON A, BELLIS L J, BENTO A P, etc. ChEMBL: a large-scale bioactivity database for drug discovery[J]. Nucleic Acids Research, 2011, 40(D1): D1100-D1107. DOI:10.1093/nar/gkr777. UniProt comes from THE UNIPROT CONSORTIUM. UniProt: the universal protein knowledgebase[J]. Nucleic Acids Research, 2016, 45(D1): D158-D169. DOI:10.1093/nar/gkw1099. HGNC comes out of EYRE T A, DUCLUZEAU F, SNEDDON T P, etc. The HUGO Gene Nomenclature Database, 2006 updates[J]. Nucleic Acids Research, 2006, 34(suppl_1): D319-D321. DOI:10.1093/nar/gkj147. DisGeNET out of PIÑERO J, QUERALT-ROSINACH N, BRAVO À, etc. DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes[J]. Database, 2015, 2015: bav028. DOI:10.1093/database/bav028. OMIM comes out of MCKUSICK V. Online Mendelian inheritance in man (OMIM) database [J]. Bethesda: National Center for Biotechnology Information for the National Institute of Health, 2004. Orphanet from https:// www.orpha.net/. ClinicalTrials.gov comes from https:// Clinicaltrials gov/. PubMed is from https:// PubMed. Reactome out of CROFT D, O'KELLY G, WU G, etc. Reactome: a database of reactions, pathways and biological processes[J]. Nucleic Acids Research, 2010, 39(suppl_1): D691-D697. DOI:10.1093/nar/gkq1018. KEGG is from https:// www.genome.jp/KEGG/. Pathway Commons from CERAMI E G, GROSS B E, DEMIR E, etc. Pathway Commons, a web resource for biological pathway data[J]. Nucleic Acids Research, 2010, 39(suppl_1): D685-D690. DOI:10.1093/nar/gkq1039. SIDER comes out of KUHN M, LETUNIC I, JENSEN L J, etc. The SIDER database of drugs and side effects[J]. Nucleic Acids Research, 2015, 44(D1): D1075-D1079. DOI:10.1093/nar/gkv1075. Disclosure of Invention The invention aims to solve the problems of scattered data sources, heterogeneous structures and difficult direct application in the prior art. In order to solve the problems, the invention provides a rare disease drug redirection path mining method based on a knowledge graph, which comprises the following steps: step 1, constructing a knowledge graph; The method com