CN-122019841-A - Microbe data semanteme processing method and system based on graph model

CN122019841ACN 122019841 ACN122019841 ACN 122019841ACN-122019841-A

Abstract

The invention discloses a graph model-based microbiological data semanteme processing method and a graph model-based microbiological data semanteme processing system, which relate to the technical field of knowledge graph and microbiological information processing, wherein the method comprises the steps of acquiring microbiological data from heterogeneous data sources and performing standardized processing; the method comprises the steps of constructing a domain ontology model conforming to the microbial taxonomy standard, utilizing a pre-training language model to realize entity disambiguation and unified identifier mapping, extracting semantic relation triplets in a mode of combining remote supervision and deep learning, constructing a microbial knowledge attribute graph and storing the microbial knowledge attribute graph in a graph database, utilizing a graph attention network to conduct knowledge reasoning and function prediction, and supporting natural language semantic retrieval.

Inventors

SU XINGMEI
WU YANMEI
WANG JINXU
Zeng Qixian
She Junze
ZHONG SUMEI
ZHAO CHONGZHI
HUANG YAXUAN
Feng jindi
LIU YIFEI

Assignees

韩山师范学院

Dates

Publication Date: 20260512
Application Date: 20260130

Claims (10)

1. The microbiological data semanteme processing method based on the graph model is characterized by comprising the following steps of: The method comprises the steps of data acquisition and preprocessing, namely acquiring heterogeneous microorganism data from a strain preservation library database, a genome sequence database, a microbiological literature library and an experiment record system, and performing format unified conversion, missing value filling and noise data filtering operation on the heterogeneous microorganism data to generate a standardized microorganism data set; An ontology model construction step of constructing an ontology model of the microorganism field according to a microorganism taxonomy specification, wherein the ontology model of the microorganism field comprises a classification hierarchy structure from a strain hierarchy to a family hierarchy through a species hierarchy and a genus hierarchy, and an attribute relation framework covering gene attributes, metabolite attributes, culture condition attributes, physiological characteristic attributes and application function attributes; Entity identification and linking, namely extracting a candidate entity set from the standardized microorganism data set by using a named entity identification algorithm, calculating entity disambiguation vector representation based on the microorganism field ontology model aiming at synonym and synonym phenomena existing in the candidate entity set, mapping each entity in the candidate entity set to a corresponding concept node in the microorganism field ontology model, and generating a unified identifier mapping result; A relation extraction step of identifying semantic associations between strains and genes, between strains and metabolites, between strains and habitats and between strains and functions by using a deep learning relation extraction model based on the unified identifier mapping result and the standardized microorganism data set, generating a semantic relation triplet set, and calculating a relation confidence value for each triplet in the semantic relation triplet set; A knowledge graph construction step, namely taking the unified identifier mapping result as a node set, taking the semantic relation triplet set as an edge set, constructing a microbial knowledge attribute graph structure, and storing the microbial knowledge attribute graph structure as a graph database storage format; A knowledge reasoning step, namely carrying out graph neural network reasoning operation based on the microbial knowledge attribute graph structure, calculating the functional similarity among strains, predicting potential functional attributes of unlabeled strains, and generating knowledge reasoning result data, wherein the knowledge reasoning result data comprises strain functional prediction data and metabolic pathway association prediction data; And a semantic retrieval step, namely receiving a natural language query request input by a user, converting the natural language query request into a graph query sentence, executing graph traversal retrieval operation on the microbial knowledge attribute graph structure, and generating a semantic retrieval response result by combining the knowledge reasoning result data.
2. The method for semantically processing microbial data based on a graph model according to claim 1, wherein in the data acquisition and preprocessing step, the format unification conversion operation comprises: converting the structured strain records from the database of the strain preservation library into a unified entity attribute table format; converting FASTA format and GenBank format sequence files from the genome sequence database into standardized sequence annotation format; parsing unstructured text content from a microbiological document library into a semi-structured markup document format; the experimental parameter data from the experimental recording system is converted into a key value pair attribute list format.
3. The method of claim 1, wherein in the entity recognition and linking step, the calculating entity disambiguation vector representations based on the microbial field ontology model comprises: Obtaining context text fragments of each candidate entity in the candidate entity set in the standardized microorganism data set, and encoding the context text fragments by utilizing a pre-training language model to generate context semantic feature vectors; Extracting definition description text and hierarchical path information of each concept node from the microbial field ontology model, and performing joint coding on the definition description text and the hierarchical path information by using the pre-training language model to generate an ontology concept embedding vector; And calculating cosine similarity values between the context semantic feature vectors and each ontology concept embedded vector in the microorganism field ontology model, and mapping corresponding candidate entities to concept nodes with highest similarity when the cosine similarity values exceed a preset similarity threshold value to generate entity disambiguation vector representations.
4. The method for semantically processing microbial data based on a graph model according to claim 1, wherein in the ontology model construction step, the construction of the attribute relationship frame includes: defining a coding relation type and a regulation relation type between the strain entity type and the gene entity type; Defining a generation relation type and a degradation relation type between the strain entity type and the metabolite entity type; Defining inhabitation relationship types and separation from relationship types between strain entity types and habitat entity types; defining a functional relationship type and a participation process relationship type between a strain entity type and a functional entity type; a relationship attribute constraint is set for each relationship type, the relationship attribute constraint comprising a relationship directionality constraint, a relationship cardinality constraint, and a relationship transitivity constraint.
5. The method for semantically processing the graph-based microbial data according to claim 1, wherein in the relation extracting step, the identifying semantic associations using the deep learning relation extracting model includes: Automatically generating a remote supervision training sample set by a remote supervision mode based on the defined relation types in the microbial field ontology model and known relation examples in a public microbial database; constructing a relationship classification depth neural network, wherein the relationship classification depth neural network comprises an entity pair coding layer, a relationship expression layer and a classification output layer, and the relationship classification depth neural network is trained by using the remote supervision training sample set; And inputting the entity pairs and the corresponding contexts in the unified identifier mapping result into the trained relationship classification deep neural network to obtain a relationship type prediction result and the relationship confidence value of each entity pair.
6. The method for semantically processing microbial data based on a graph model according to claim 1, wherein in the knowledge graph construction step, the storing the microbial knowledge attribute graph structure as a graph database storage format includes: storing each entity in the unified identifier mapping result as a node record in a graph database, and configuring a node type label and a node attribute field for each node record; Storing each triplet in the semantic relation triplet set as an edge record in a graph database, and configuring a relation type label, the relation confidence value and a relation source attribute for each edge record; And establishing a node index structure and an edge index structure, wherein the node index structure comprises a classification index based on node types and a full-text index based on node attributes, and the edge index structure comprises a quick query index based on relation types.
7. The method for semantically processing microbial data based on a graph model according to claim 1, wherein in the knowledge reasoning step, the performing graph neural network reasoning operation based on the microbial knowledge attribute graph structure comprises: Carrying out neighborhood information aggregation on each node in the microbial knowledge attribute graph structure by using a graph attention network to generate node level embedded vector representation; Calculating a strain similarity measurement value between any two strain nodes based on the node level embedded vector representation, wherein the strain similarity measurement value comprehensively reflects genome similarity, metabolic function similarity and niche similarity among strains; utilizing a link prediction algorithm to predict potential relation edges which do not exist but are possibly exist in the microbial knowledge attribute graph structure based on the node level embedded vector representation, and generating strain function prediction data; Based on the connection relation between the metabolite nodes, the strain nodes and the enzyme nodes in the microbial knowledge attribute graph structure, a metabolic pathway association graph structure is constructed, and a complete metabolic pathway mode is identified by utilizing a sub-graph matching algorithm.
8. The method of claim 1, wherein in the semantic search step, the converting the natural language query request into a graph query statement comprises: performing intention recognition on the natural language query request, and judging whether the query type is entity attribute query, relationship path query or aggregation statistics query; Extracting query entities, query relationships and query constraint conditions from the natural language query request by utilizing a semantic analysis model; Generating a corresponding graph query statement according to the query type, the query entity, the query relation and the query constraint condition, wherein the graph query statement accords with a graph database query language specification; and executing the graph query statement on the microbial knowledge attribute graph structure, acquiring a graph path traversing result, and carrying out fusion sequencing on the graph path traversing result and the knowledge reasoning result data to generate the semantic retrieval response result.
9. The method for semantically processing microbial data based on a graph model according to claim 1, further comprising a knowledge updating step, the knowledge updating step comprising: periodically acquiring incremental data from the strain conservation library database, the genome sequence database, the microbiological literature library and the experiment record system; executing the entity identification and linking step and the relation extraction step on the incremental data to generate an incremental entity set and an incremental relation set; Detecting conflict situations between the increment entity set and existing nodes in the microbial knowledge attribute graph structure, and determining a reserved version according to the credibility of a data source and the new and old degree of a time stamp when attribute value conflicts exist; And fusing the increment entity set and the increment relation set into the microbial knowledge attribute graph structure, and triggering increment reasoning updating operation of the knowledge reasoning step.
10. A graph model-based microbiological data semantical processing system for implementing the graph model-based microbiological data semantical processing method as set forth in any one of claims 1 to 9, comprising: The data acquisition module is used for acquiring heterogeneous microorganism data from a strain preservation library database, a genome sequence database, a microbiological literature library and an experiment record system, performing format unified conversion, missing value filling and noise data filtering operation on the heterogeneous microorganism data, and generating a standardized microorganism data set; The system comprises a body construction module, a model generation module and a model generation module, wherein the body construction module is used for constructing a microorganism field body model according to microorganism taxonomy specifications, and the microorganism field body model comprises a classification hierarchy structure from a strain level to a family level through a species level and a genus level, and an attribute relation framework covering gene attributes, metabolite attributes, culture condition attributes, physiological characteristic attributes and application function attributes; The entity identification module is used for extracting a candidate entity set from the standardized microorganism data set by using a named entity identification algorithm, calculating entity disambiguation vector representation based on the microorganism field ontology model aiming at synonym and synonym phenomena existing in the candidate entity set, mapping each entity in the candidate entity set to a corresponding concept node in the microorganism field ontology model, and generating a unified identifier mapping result; the relation extraction module is used for utilizing a deep learning relation extraction model to identify semantic relations among strains, genes, strains, metabolites, strains, habitats and functions based on the unified identifier mapping result and the standardized microorganism data set, generating a semantic relation triplet set, and calculating relation confidence values for all triples in the semantic relation triplet set; The map construction module is used for taking the unified identifier mapping result as a node set, taking the semantic relation triplet set as an edge set, constructing a microbial knowledge attribute map structure, and storing the microbial knowledge attribute map structure as a map database storage format; The knowledge reasoning module is used for executing graph neural network reasoning operation based on the microbial knowledge attribute graph structure, calculating the functional similarity among strains and predicting potential functional attributes of unlabeled strains, and generating knowledge reasoning result data which comprises strain functional prediction data and metabolic pathway association prediction data; The semantic retrieval module is used for receiving a natural language query request input by a user, converting the natural language query request into a graph query statement, executing graph traversal retrieval operation on the microbial knowledge attribute graph structure, and generating a semantic retrieval response result by combining the knowledge reasoning result data.

Description

Microbe data semanteme processing method and system based on graph model Technical Field The invention relates to the technical field of knowledge graph and microbial information processing, in particular to a graph model-based microbial data semanteme processing method and system. Background With the rapid development of high-throughput sequencing technology and microbiology research, massive microbial related data resources are accumulated worldwide. These data resources are stored in a decentralized manner in various types of databases and information systems, including strain information databases of strain conservation institutions, sequence databases of genome sequencing centers, academic literature databases, and laboratory recording systems of various laboratories. However, due to historical reasons and differences in technical standards, serious data island problems exist among the heterogeneous data sources, and deep integration utilization and knowledge discovery capability of microorganism data are restricted. The Chinese patent with publication number CN118152578A discloses a method for constructing a knowledge graph, which is characterized in that structured data is extracted from a relational database and mapped to an ontology model, and unstructured text is processed by combining named entity recognition and a relational classification model, so that generation and update of a ternary group of the knowledge graph are realized. However, when the method is applied to the field of microorganisms, the technical problems that firstly, serious synonym phenomenon exists in microorganism naming, different naming forms of the same strain can be adopted in different data sources, including Latin name, common name, preservation number and other multiple identification modes, the equivalent relation between different naming forms is difficult to accurately establish by the existing entity identification method, secondly, the microorganism taxonomy system is in a continuously revised state, mapping between historical data and the current taxonomy system is difficult due to repartition of a genus-level and species-level classification unit, the existing ontology model construction method lacks support capability for dynamic evolution of taxonomies, thirdly, functional attributes and ecological characteristics of the strain are often hidden in document description and experimental data, and the identification precision of the existing relation extraction method for complex semantic relations such as strain-phenotype association, strain-habitat association and the like specific to the microorganism field is insufficient. Therefore, a knowledge graph construction method capable of effectively integrating heterogeneous microorganism data, solving the problem of microorganism naming disambiguation, supporting dynamic construction of microorganism taxonomy ontology and deeply mining functional association of strains is needed to improve the semantical level and knowledge service capability of microorganism data resources. Disclosure of Invention The invention aims to provide a graph model-based microorganism data semanteme processing method and system, which are used for solving the technical problems of difficult integration of heterogeneous microorganism data, insufficient microorganism naming disambiguation precision and limited strain function association mining capacity in the prior art. The invention provides a graph model-based microbiological data semanteme processing method, which comprises the steps of acquiring heterogeneous microbiological data from a strain preservation database, a genome sequence database, a microbiological literature library and an experiment record system, performing format unified conversion, missing value filling and noise data filtering operation on the heterogeneous microbiological data to generate a standardized microbiological data set, constructing a microbiological domain ontology model according to microbiological taxonomy specifications, constructing the microbiological domain ontology model which comprises a classification hierarchy structure from a strain level to a family level through a species level, a metabolic product attribute, a culture condition attribute, a physiological characteristic attribute and an attribute relation frame covering the gene attribute, a metabolic product attribute and an application function attribute, extracting a candidate entity set from the standardized microbiological data set by using a named entity recognition algorithm, calculating entity disambiguation vector representation based on the microbiological domain ontology model aiming at synonym and one name of the synonym existing in the candidate entity set, mapping each entity in the candidate entity set to a corresponding concept in the microbiological domain ontology model, generating a concept, generating a mapping relation, extracting a three-dimensional relation, and a three