CN-122024816-A - Genetic disease candidate gene sequencing method and device based on heterogeneous map embedding

CN122024816ACN 122024816 ACN122024816 ACN 122024816ACN-122024816-A

Abstract

The application discloses a genetic disease candidate gene sequencing method and device based on isomerism map embedding, and relates to the field of biological information. The method comprises the steps of constructing a phenotype-gene heterogeneous network, determining edge weights in the heterogeneous network according to association frequencies of genes and phenotypes in clinical medical records, capturing heterogeneous neighbor nodes based on random walk with weights of the element paths according to types of the neighbor nodes, obtaining node embedding in the heterogeneous network, evaluating the possibility that candidate genes are pathogenic genes according to node embedding corresponding to the phenotype nodes and node embedding corresponding to the gene nodes, and sequencing the priority of the candidate genes according to evaluation results. By the method, heterogeneous information in a biological network is effectively captured, the priority ordering accuracy of candidate genes is improved, the historical medical record data is introduced to generate edge weights, and the expression capacity of the heterograms and the credibility of the pathogenic gene prediction results are improved.

Inventors

YAN SHANKAI
YANG XIN
Zhan Buchao
HE DONGMEI
ZHANG JIANGBO
DONG SIQI

Assignees

海南大学

Dates

Publication Date: 20260512
Application Date: 20260108

Claims (10)

1. A genetic disease candidate gene sequencing method based on isomerism map embedding, which is characterized by comprising the following steps: Constructing a phenotype-gene heterogeneous network, and determining edge weights in the heterogeneous network according to the association frequency of genes and phenotypes in clinical medical records; Capturing heterogeneous neighbor nodes based on random walk with weight of a meta-path according to the types of the neighbor nodes, and acquiring node embedding in the heterogeneous network; And evaluating the possibility that the candidate genes are pathogenic genes according to the node embedding corresponding to the phenotype node and the node embedding corresponding to the gene node, and sequencing the priority of the candidate genes according to the evaluation result.
2. The method of claim 1, wherein said constructing a phenotype-gene heterogeneous network comprises: constructing a hierarchical tree according to HPO data; and according to the gene phenotype annotation data, adopting a reverse order recursion method, and gradually eliminating redundant gene annotations of non-leaf nodes from the last node of the hierarchical tree upwards to obtain the abnormal composition.
3. The method of claim 1, wherein the acquiring node embeddings in the heterogeneous network based on the meta-path weighted random walk capturing heterogeneous neighbor nodes according to the type of neighbor node comprises: Determining neighbor nodes, selecting next-hop nodes according to the edge weights and random walks with weights, and acquiring node sequences according to a plurality of next-hop nodes; and optimizing the node sequence by using a skip-gram model, and acquiring the node embedding according to the optimized node sequence.
4. The method of claim 1, wherein the evaluating the likelihood of candidate genes being pathogenic based on the node embedment corresponding to a phenotype node and the node embedment corresponding to a gene node comprises: and performing dot product on the node embedding corresponding to the phenotype node and the node embedding corresponding to the gene node, and evaluating the possibility of the candidate gene as a pathogenic gene according to dot product results.
5. The method of claim 4, wherein ranking the candidate gene priorities based on an evaluation result comprises: and sequencing the priority of the candidate genes according to the size of the dot product result.
6. The method according to claim 1, wherein the method further comprises: And determining pathogenic genes from the candidate genes according to the sequencing result.
7. A genetic disease candidate gene sequencing device based on heterogeneous map embedding, characterized in that the device comprises: The diagram construction module is used for constructing a phenotype-gene heterogeneous network and determining the edge weight in the heterogeneous network according to the association frequency of genes and phenotypes in clinical medical records; The embedded vector generation module is used for capturing heterogeneous neighbor nodes based on random walk with weight of a meta-path according to the type of the neighbor nodes and obtaining node embedding in the heterogeneous network; and the output module is used for evaluating the possibility that the candidate genes are pathogenic genes according to the node embedding corresponding to the phenotype node and the node embedding corresponding to the gene node, and sequencing the priority of the candidate genes according to the evaluation result.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.

Description

Genetic disease candidate gene sequencing method and device based on heterogeneous map embedding Technical Field The application relates to the technical field of biological information, in particular to a genetic disease candidate gene sequencing method and device based on isomerism map embedding. Background Mendelian's genetic disease affects hundreds Mo Xinsheng of animals each year. Early identification of pathogenic genes is critical to prevent disease progression and to propose an effective therapeutic strategy. The widespread use of new generation sequencing (NGS, next-Generation Sequencing) technology has greatly increased the molecular diagnostic level of mendelian genetic disease. However, diagnosing pathogenic genes from next generation sequencing data is a time consuming task, and constitutes a significant challenge for real-world clinical applications. With the continued development of the human phenotype ontology (Human Phenotype Ontology, HPO), the use of deep phenotypes (Phenotype-driven) as input has become a routine approach for candidate causal gene sequencing. Recent research approaches mainly include data statistics like PhenoApt (phenotype driven gene prioritization tool), AMELIE (Automatic Mendelian Literature Evaluation, automated mendelian document evaluation), GADO (GeneNetwork Assisted Diagnostic Optimizer, gene network aided diagnosis optimizer), etc. There are also methods of using graph embedding, creating a heterogeneous network and performing data mining to realize disease diagnosis by merging data from biomedical databases such as HPO, OMIM (Online MENDELIAN INHERITANCE IN MAN, online human mendelian genetic database), wikiPath, etc., such as CADA (Case Annotations, AS WELL AS DISEASE Annotations, case annotation and disease annotation), HANRD (Heterogeneous Association Network for RARE DISEASES, rare disease heterogeneous association network), etc. PhenoApt is a method capable of predicting both a causative gene and a disease. It builds rule-based directed graphs using existing biological databases (e.g., HPO and OMIM). The PMI (Pointwise Mutual Information) matrix is constructed by adopting a shortest path method and a related probability method. It minimizes the difference between the dot product embedded by two nodes and the PMI matrix. It then uses these node embeddings to calculate scores, thereby facilitating prioritization of candidate genes. In addition, phenoapp uses the TF-IDF (Term Frequency-inverse document Frequency) algorithm to calculate the HP Frequency, i.e., HPO Term Frequency. This frequency serves as the weight of the HP term in the candidate gene score calculation, ultimately improving the accuracy of the prediction. GADO use gene expression data from 31499 human RNA-seq samples to predict disease symptoms. For each component, it is assessed whether there is a significant difference in eigenvector coefficients between the gene associated with a particular phenotype and a set of background genes. This will result in a matrix representing the information principal of each HPO item. By correlating this matrix with the eigenvector coefficients of each gene, it is possible to infer the HPO disease phenotype term that may be caused by the pathogenicity variation of that gene. CADA Peng et al construct a base graph using HPO and its annotations. They enhance the edge information of the graph by integrating clinical data from ClinVar, etc., and additionally apply the graph embedding algorithm Node2vec to obtain the embedding of each Node. Among all the above methods, the data statistics-based methods tend to lack high-order information between the mining phenotype and the genes, resulting in their diagnostic effects tend to be inferior to the graph-embedding-based algorithms. However, current graph embedding based algorithms ignore heterogeneous information in heterogeneous patterns. In particular, existing graph embedding methods focus on isomorphic information while ignoring valuable heterogeneous information when processing biomedical databases. Disclosure of Invention Based on the above, it is necessary to provide a method and a device for sorting candidate genes of genetic diseases based on the embedding of the isomerism map, so as to improve the positioning accuracy of the candidate genes. In a first aspect, the application provides a method for ordering genetic disease candidate genes based on isograph embedding. The method comprises the following steps: constructing a phenotype-gene heterogeneous network, and determining edge weights in the heterogeneous network according to the association frequency of genes and phenotypes in clinical medical records; capturing heterogeneous neighbor nodes based on random walk with weight of the meta-path according to the types of the neighbor nodes, and obtaining node embedding in the heterogeneous network; and evaluating the possibility that the candidate genes are pathogenic genes according to node embedding corresp