Search

CN-122024827-A - Rice drought-enduring gene prediction method and system based on knowledge map embedding

CN122024827ACN 122024827 ACN122024827 ACN 122024827ACN-122024827-A

Abstract

The invention discloses a method and a system for predicting rice drought-tolerance genes based on knowledge graph embedding, wherein the method comprises the steps of constructing a body layer of a rice knowledge graph, collecting multi-source data of rice genes from different databases, extracting knowledge, fusing the extracted knowledge according to specifications of the body layer to obtain different knowledge triples, importing the knowledge triples into a graph database to construct a data layer of the rice knowledge graph, training an Att-CompGCN model by using the rice knowledge graph to learn low-dimensional embedding representation of entities and relations, predicting association probabilities of all rice genes and drought-tolerance shapes by using the trained Att-CompGCN model, sorting in descending order to obtain a candidate gene list, performing GO enrichment analysis and protein interaction network verification on the candidate genes, and screening high-confidence drought-tolerance genes. The invention can obtain more accurate prediction results.

Inventors

  • ZHANG HONGYAN
  • LIU JING
  • QIAO BO
  • DENG YONG

Assignees

  • 湖南农业大学
  • 岳麓山实验室

Dates

Publication Date
20260512
Application Date
20260409

Claims (10)

  1. 1. A rice drought-enduring gene prediction method based on knowledge map embedding is characterized by comprising the following steps: Constructing a body layer of the rice knowledge graph, wherein the body layer defines an entity type system comprising a core class of genes, transcripts, proteins, paths, bodies and references and hierarchical subclasses thereof, and defines data attribute specifications of each entity and semantic relations among the entities; Collecting multi-source data of rice genes from different databases, extracting knowledge, fusing the extracted knowledge according to the specification of a body layer to obtain different knowledge triples, and importing the knowledge triples into a graph database to construct a data layer of a rice knowledge graph, wherein the knowledge triples comprise a head entity, a tail entity and a relationship between the head entity and the tail entity, the head entity and the tail entity are respectively used as nodes in the graph, and the relationship between the head entity and the tail entity is used as an edge in the graph; Training a graph neural network model Att-CompGCN by using a rice knowledge graph to learn low-dimensional embedded representation of entities and relations, wherein the graph neural network model Att-CompGCN adopts a graph convolution frame based on message transmission, and in a message transmission stage, each graph convolution layer firstly uses an adaptive gating fusion module to embed and calculate gating values according to neighbor nodes of target nodes in the graph and relations and calculate fusion characteristics according to the gating values, then uses a relational-aware local attention module to embed and calculate attention scores of edges from the neighbor nodes to the target nodes according to the fusion characteristics and the relations, calculates attention weights of edges pointing to the same target node according to the attention scores, and finally uses the attention weights and the fusion characteristics to carry out weighted aggregation to update the embedded representation of the target nodes; predicting the association probabilities of all rice gene entities and drought-tolerance physical entities in the rice knowledge graph by using a trained graph neural network model Att-CompGCN, and sorting according to the descending order of probability scores to obtain candidate genes; and performing GO enrichment analysis and protein interaction network verification on the candidate genes, and screening the drought-tolerant genes with high confidence.
  2. 2. The method for predicting drought tolerance genes of rice based on knowledge map embedding of claim 1, wherein when multi-source data of rice genes are collected from different databases and knowledge extraction is performed, comprising: if the database is a structured data source, identifying values of all fields in the database according to a preset field mapping rule, creating corresponding entities or attributes, and creating corresponding semantic edges according to the corresponding relation of the data in the same row; if the database is a semi-structured data source, a special analyzer is used for identifying keywords and extracting corresponding entities and relations; If the database is a weak structured data source, positioning a target page and acquiring complete HTML page content, traversing a page DOM structure, positioning a region containing a gene list, extracting elements corresponding to the gene items one by one in the region, analyzing to obtain an internal number, and finally matching the internal number with a standard gene name in a standard database.
  3. 3. The method for predicting rice drought-enduring genes based on knowledge graph embedding according to claim 2, wherein the weak structured data source is KEGG PATHWAY detailed pages, the standard database is an NCBI Gene database, when the internal number is matched with the standard Gene name in the standard database, specifically, the query URL of the NCBI Gene database is generated according to the internal number, the structured information of the corresponding page is crawled, then the specific area of the NCBI Gene page is analyzed, and the corresponding Gene name is obtained through regular expression LOC_Os\d {2} g\d+|Os\d {2} g\d+ matching.
  4. 4. The method for predicting drought-enduring genes of rice based on knowledge map embedding of claim 1, wherein when fusing extracted knowledge according to specifications of ontology layers to obtain different knowledge triples, comprising: establishing a cross-library entity mapping rule by taking a core class in the body layer as a reference, and associating entity information of different data sources to the same entity to eliminate entity redundancy; when the attributes of different data sources conflict, corresponding marks in an authoritative database are used as targets, and the standard definition of the EC number in the ontology is used for correcting the deviation; aggregating and supplementing the dispersion attributes of different data sources based on the uniform entity identification; and carrying out association integration on the knowledge triples of different data sources according to semantic relations defined by the ontology.
  5. 5. The method for predicting rice drought tolerance genes based on knowledge graph embedding of claim 1, wherein when a gating value is calculated according to neighbor node embedding and relation embedding of a target node in the graph, the expression is as follows: Wherein, the The j-th neighbor node representing the target node is embedded, The relation embedding representing the target node, Representing the sum of the element-by-element products, Representing the Sigmoid activation function.
  6. 6. The method for predicting drought tolerance genes of rice based on knowledge map embedding of claim 5, wherein when calculating fusion characteristics according to gating values, the expression is as follows: Wherein, the Is a combined operating function.
  7. 7. The method for predicting drought tolerance genes of rice based on knowledge graph embedding of claim 6, wherein when calculating the attention score of each edge from the neighboring node to the target node according to the fusion characteristic and the relation embedding, the expression is as follows: Wherein, the Representing the attention score of the jth neighbor node to the target node, The activation of the representation LeakyReLU is performed, The j-th neighbor node representing gating enhancement is embedded.
  8. 8. The method for predicting drought tolerance genes of rice based on knowledge graph embedding of claim 7, wherein when attention weights of all edges pointing to the same target node are calculated according to attention scores, the expression is as follows: Wherein, the Representing the attention weight of the normalized j-th neighbor node of the layer to the edge of the target node i, And Representing the attention scores of the jth and kth neighbor nodes of the first layer to the edge of the target node i respectively, Representing the neighbor set of the target node i.
  9. 9. The method for predicting drought tolerance genes of rice based on knowledge map embedding of claim 8, wherein when updating the embedded representation of the target node by weighted aggregation of attention weights and fusion features, the mathematical expression is as follows: Wherein, the Is the target node i is at the first An embedded representation of the layer(s), Is the first The layers are based on a weight matrix of the relation r direction in the knowledge triplet, Is the first The attention weight of the layer(s), Is the first The j-th neighbor node after the gating fusion of the layers is embedded, Is a nonlinear activation function.
  10. 10. The system for predicting the rice drought-tolerance gene based on the knowledge graph embedding is characterized by comprising a processor and a computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, and the computer program is executed by the processor to realize the steps of the method for predicting the rice drought-tolerance gene based on the knowledge graph embedding according to any one of claims 1-8.

Description

Rice drought-enduring gene prediction method and system based on knowledge map embedding Technical Field The invention relates to a gene prediction technology, in particular to a rice drought-enduring gene prediction method and system based on knowledge map embedding. Background Drought is one of main abiotic stress factors affecting rice growth and development and yield, and the development and utilization of drought-enduring gene resources has important significance for guaranteeing grain safety. Currently, drought tolerance related genes are found by QTL localization based on linkage analysis, transcriptome analysis based on expression profile differences and homology comparison methods based on sequence similarity. These methods are usually focused on a single type of data (e.g., genomic sequence or gene expression data), and it is difficult to fully characterize the multi-level interactions of genes with other molecular entities (e.g., proteins, metabolites, regulatory elements) in complex biological networks. Thus, traditional approaches tend to be limited in their predictive ability in the face of complex quantitative traits (such as drought tolerance) that are synergistically regulated by multiple genes, and it is difficult to provide interpretable clues to the biological mechanisms behind gene function. The knowledge map is used as a structured semantic knowledge base, can fuse and organize multi-source heterogeneous biological data such as genes, proteins, metabolic pathways, phenotypic traits, documents and the like in a graph form, and provides a data basis for understanding gene functions at a system level. Recently, graphic neural network technology has evolved rapidly, exhibiting superior performance in processing graphic structure data, learning low-dimensional embedded representations of nodes (e.g., genes). However, how to design a graph neural network model capable of fully capturing complex semantic relationships (particularly high heterogeneity of relationships) in a knowledge graph and effectively applying the graph neural network model to accurate prediction of drought tolerance genes of crops remains a challenge of current researches. Disclosure of Invention Aiming at the problems in the prior art, the invention provides a method and a system for predicting rice drought-enduring genes based on knowledge graph embedding, which can obtain more accurate prediction results. In order to solve the technical problems, the invention adopts the following technical scheme: A rice drought-enduring gene prediction method based on knowledge map embedding comprises the following steps: Constructing a body layer of the rice knowledge graph, wherein the body layer defines an entity type system comprising a core class of genes, transcripts, proteins, paths, bodies and references and hierarchical subclasses thereof, and defines data attribute specifications of each entity and semantic relations among the entities; Collecting multi-source data of rice genes from different databases, extracting knowledge, fusing the extracted knowledge according to the specification of a body layer to obtain different knowledge triples, and importing the knowledge triples into a graph database to construct a data layer of a rice knowledge graph, wherein the knowledge triples comprise a head entity, a tail entity and a relationship between the head entity and the tail entity, the head entity and the tail entity are respectively used as nodes in the graph, and the relationship between the head entity and the tail entity is used as an edge in the graph; Training a graph neural network model Att-CompGCN by using a rice knowledge graph to learn low-dimensional embedded representation of entities and relations, wherein the graph neural network model Att-CompGCN adopts a graph convolution frame based on message transmission, and in a message transmission stage, each graph convolution layer firstly uses an adaptive gating fusion module to embed and calculate gating values according to neighbor nodes of target nodes in the graph and relations and calculate fusion characteristics according to the gating values, then uses a relational-aware local attention module to embed and calculate attention scores of edges from the neighbor nodes to the target nodes according to the fusion characteristics and the relations, calculates attention weights of edges pointing to the same target node according to the attention scores, and finally uses the attention weights and the fusion characteristics to carry out weighted aggregation to update the embedded representation of the target nodes; predicting the association probabilities of all rice gene entities and drought-tolerance physical entities in the rice knowledge graph by using a trained graph neural network model Att-CompGCN, and sorting according to the descending order of probability scores to obtain candidate genes; and performing GO enrichment analysis and protein interaction network ve