US-12619652-B2 - Automated ontology creation
Abstract
Class definitions for an ontology of a domain are determined using a materialized instance graph, where the ontology is used for semantic query execution, automated analytical reasoning, or for machine learning. A plurality of instances graphs for a respective plurality of domain instances are received. A materialized instance graph is generated from the plurality of instance graphs. One or more communities represented in the materialized instance graph are determined. Properties associated with respective communities of the one or more communities are determined. Class definitions are generated, where a class corresponds to a community of the one or more communities and at least a portion of properties associated with the community. Class definitions are assigned to the ontology for the domain.
Inventors
- Jan Portisch
- Sandra Bracholdt
Assignees
- SAP SE
Dates
- Publication Date
- 20260505
- Application Date
- 20231023
Claims (20)
- 1 . A computing system comprising: at least one memory; one or more hardware processing units coupled to the at least one memory; and one or more computer readable storage media storing computer-executable instructions that, when executed, cause the computing system to perform operations comprising: receiving a plurality of instance graphs for a respective plurality of domain instances, each instance graph comprising nodes corresponding to property-value pairs and edges representing associations among those property-value pairs for, and associated with, the corresponding domain instance; generating a materialized instance graph by transforming nodes and property relationships of the plurality of instance graphs, wherein the materialized instance graph aggregates nodes and edges derived from multiple ones of the plurality of instance graphs as a property-encoded graph having unlabeled edges connecting nodes that each represent a property-literal pair encoded together to identify both a property and its corresponding literal value, including shared property-literal nodes corresponding to common property-value pairs across the domain instances, the materialized instance graph comprising at least one hundred nodes; determining, using a graph-clustering or community-detection algorithm, one or more communities represented in the materialized instance graph, each community comprising property-literal nodes that co-occur across multiple domain instances; determining, for each of the one or more communities, properties associated with nodes of the respective communities of the one or more communities based on connectivity of property-literal nodes within the materialized instance graph, each property-literal node being labeled with a property identifier and a corresponding literal value; generating class definitions, where a class corresponds to a community of the one or more communities and each class definition comprises at least a portion of the properties associated with that community; and assigning the class definitions to an ontology for the domain; wherein the ontology is used to execute semantic queries, for automated analytical reasoning, or for machine learning.
- 2 . The computing system of claim 1 , wherein the encoding comprises a concatenation of a property in one or more of the plurality of instance graphs and a literal associated with the property.
- 3 . The computing system of claim 1 , wherein determining one or more communities represented in the materialized instance graph comprises removing one or more nodes of the materialized instance graph to provide a plurality of subgraphs, wherein at least a portion of the subgraphs correspond to communities.
- 4 . The computing system of claim 3 , wherein removing one or more nodes comprises removing nodes satisfying a betweenness threshold.
- 5 . The computing system of claim 1 , the operations further comprising: determining cluster scores for properties in at least one respective community of the plurality of communities.
- 6 . The computing system of claim 5 , the operations further comprising: based at least in part on the cluster scores, determining properties to be recommended for inclusion in a class definition corresponding to the at least one respective community; and rendering a user interface displaying at least a portion of the properties determined for the at least one respective community and identifying properties of the at least a portion of the properties that are recommended for inclusion in the class definition.
- 7 . The computing system of claim 5 , the operations further comprising: based at least in part on the cluster scores, determining properties to be recommended as requirements in a class definition corresponding to the at least one respective community; and rendering a user interface displaying at least a portion of the properties determined for the at least one respective community and identifying properties of the at least a portion of the properties that are recommended as requirements for the class definition.
- 8 . The computing system of claim 1 , the operations further comprising: rendering a user interface; and through the user interface, receiving one or more class definition parameters, the one or more class definition parameters comprising an indication of a number of classes to be identified in the determining one or more communities.
- 9 . The computing system of claim 8 , wherein the number of classes is specified as one or more of a maximum number of classes or a minimum number of classes.
- 10 . The computing system of claim 1 , the operations further comprising: rendering a user interface; and through the user interface, receiving one or more class definition parameters, the one or more class definition parameters comprising an indication of a number of properties to be identified for communities of the one or more communities.
- 11 . The computing system of claim 10 , wherein the number of properties is specified as one or more of a maximum number of properties or a minimum number of properties.
- 12 . The computing system of claim 1 , the operations further comprising: analyzing at least a portion of the materialized instance graph corresponding to a class definition; determining a proposed name for the class definition based at least in part on the analyzing; and rending a user interface displaying the proposed name for the class definition.
- 13 . The computing system of claim 1 , wherein the determining one or more communities comprises determining at least one community and at least another community that is a subcommunity of the at least one community.
- 14 . The computing system of claim 1 , the operations further comprising: receiving an instance graph that is not in the plurality of instance graphs; and classifying a domain instance represented in the instance graph according to the class definitions.
- 15 . A method, implemented in a computing system comprising at least one hardware processor and at least one memory coupled to the at least one hardware processor, the method comprising: receiving a plurality of instance graphs for a respective plurality of domain instances, each instance graph comprising nodes corresponding to property-value pairs and edges representing associations among those property-value pairs for, and associated with, the corresponding domain instance; generating a materialized instance graph by transforming nodes and property relationships of the plurality of instance graphs, wherein the materialized instance graph aggregates nodes and edges derived from multiple ones of the plurality of instance graphs as a property-encoded graph having unlabeled edges connecting nodes that each represent a property-literal pair encoded together to identify both a property and its corresponding literal value, including shared property-literal nodes corresponding to common property-value pairs across the domain instances, the materialized instance graph comprising at least one hundred nodes; determining, using a graph-clustering or community-detection algorithm, one or more communities represented in the materialized instance graph; determining, for each of the one or more communities, properties associated with nodes of the properties associated with respective communities of the one or more communities based on connectivity of property-literal nodes within the materialized instance graph, each property-literal node being labeled with a property identifier and a corresponding literal value; generating class definitions, where a class corresponds to a community of the one or more communities and each class definition comprises at least a portion of the properties associated with that community; and assigning the class definitions to an ontology for the domain; wherein the ontology is used to execute semantic queries, for automated analytical reasoning, or for machine learning.
- 16 . The method of claim 15 , wherein the materialized instance graph represents edges of the plurality of instance graphs as nodes in the materialized instance graph and at least portion of the nodes represent properties in the plurality of instance graphs, the method further comprising: removing one or more nodes of the materialized instance graph to provide a plurality of subgraphs, wherein at least a portion of the subgraphs correspond to communities; determining cluster scores for properties in at least one respective community of the plurality of communities; based at least in part on the cluster scores, determining properties to be recommended for inclusion in a class definition corresponding to the at least one respective community; and rendering a user interface displaying at least a portion of the properties determined for the at least one respective community and identifying properties of the at least a portion of the properties that are recommended for inclusion in the class definition.
- 17 . One or more computer-readable storage media comprising: computer-executable instructions that, when executed by a computing system comprising at least one hardware processor and at least one memory coupled to the at least on hardware processor, cause the computing system to receive a plurality of instance graphs for a respective plurality of domain instances, each instance graph comprising nodes corresponding to property-value pairs and edges representing associations among those property-value pairs for, and associated with, the corresponding domain instance; computer-executable instructions that, when executed by the computing system, cause the computing system to generate a materialized instance graph by transforming nodes and property relationships of the plurality of instance graphs, wherein the materialized instance graph aggregates nodes and edges derived from multiple ones of the plurality of instance graphs as a property-encoded graph having unlabeled edges connecting nodes that each represent a property-literal pair encoded together to identify both a property and its corresponding literal value, including shared property-literal nodes corresponding to common property-value pairs across the domain instances, the materialized instance graph comprising at least one hundred nodes; computer-executable instructions that, when executed by the computing system, cause the computing system to determine, using a graph-clustering or community-detection algorithm, one or more communities represented in the materialized instance graph, each community comprising property-literal nodes that co-occur across multiple domain instances; computer-executable instructions that, when executed by the computing system, cause the computing system to determine, for each of the one or more communities, properties associated with nodes of respective communities of the one or more communities based on connectivity of property-literal nodes within the materialized instance graph, each property-literal node being labeled with a property identifier and a corresponding literal value; computer-executable instructions that, when executed by the computing system, cause the computing system to generate class definitions, where a class corresponds to a community of the one or more communities and each class definition comprises at least a portion of the properties associated with the community; and computer-executable instructions that, when executed by the computing system, cause the computing system to assign the class definitions to an ontology for the domain, wherein the ontology is used to execute semantic queries, for automated analytical reasoning, or for machine learning.
- 18 . The one or more computer-readable storage media of claim 17 , wherein at least a portion of the instance graphs use different terminology for properties represented in the instance graphs, further comprising: computer-executable instructions that, when executed by the computing system, cause the computing system to remove one or more nodes of the materialized instance graph to provide a plurality of subgraphs, wherein at least a portion of the subgraphs correspond to communities; computer-executable instructions that, when executed by the computing system, cause the computing system to determine cluster scores for properties in at least one respective community of the plurality of communities; computer-executable instructions that, when executed by the computing system, cause the computing system to, based at least in part on the cluster scores, determine properties to be recommended for inclusion in a class definition corresponding to the at least one respective community; and computer-executable instructions that, when executed by the computing system, cause the computing system to render a user interface displaying at least a portion of the properties determined for the at least one respective community and identifying properties of the at least a portion of the properties that are recommended for inclusion in the class definition.
- 19 . The one or more computer-readable storage media of claim 17 , further comprising: computer-executable instructions that, when executed by the computing system, cause the computing system to receive an instance graph that is not in the plurality of instance graphs; and computer-executable instructions that, when executed by the computing system, cause the computing system to classify a domain instance represented in the instance graph according to the class definitions.
- 20 . The method of claim 15 , further comprising: receiving an instance graph that is not in the plurality of instance graphs; and classifying a domain instance represented in the instance graph according to the class definitions.
Description
FIELD The present disclosure generally relates to automated processes for determining components of an ontology. BACKGROUND Ontologies are important to a variety of computer implemented processes. For example, ontologies can be used in linking data in the Semantic Web, in natural language processing, in query processing (such as by converting concepts into SQL), and data integration (integrating data having a common semantic concept). Further, ontologies can be used in artificial intelligence systems, including large language models, the use of which is currently undergoing explosive growth. Typically, ontologies are created manually. Manual creation of ontologies can be extraordinarily time consuming, particular when a large number of concepts are to be expressed in an ontology. Further, manually created ontologies can vary depending on a user developing an ontology, including the use of different labels for a common semantic concept, and whether users happen to identify particular concepts—that is, some users may identify ontological concepts that might be overlooked by other users. Thus, current techniques for developing ontologies can be very time consuming, can contain errors (including not identifying relevant semantic concepts), and can be subject to terminology variation that can make use and comparison of ontologies and ontological processing difficult. Accordingly, room for improvement exists. SUMMARY This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. In one aspect, the present disclosure provides a process of determining properties to be assigned to an ontology using instance graphs. A corpus of documents is received. The corpus of documents represents a plurality of domain instances of a domain. A respective plurality of instance graphs for instances of the plurality of domain instances are generated, providing a plurality of instance graphs. Properties represented in the plurality of instance graphs are determined. At least a portion of the properties are assigned to an ontology for the domain. The ontology is used to execute semantic queries, for automated analytical reasoning, or for machine learning. In another aspect, the present disclosure provides a process of generating class definitions for an ontology using instance graphs. A plurality of instance graphs for a respective plurality of domain instances are received. A materialized instance graph is generated from the plurality of instance graphs. One or more communities represented in the materialized instance graph are determined. Properties associated with respective communities of the one or more communities are determined. Class definitions are generated, where a class corresponds to a community of the one or more communities and at least a portion of properties associated with the community. The class definitions are assigned to an ontology for the domain. The ontology is used to execute semantic queries, for automated analytical reasoning, or for machine learning. The present disclosure also includes computing systems and tangible, non-transitory computer readable storage media configured to carry out, or including instructions for carrying out, an above-described method. As described herein, a variety of other features and advantages can be incorporated into the technologies as desired. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows an exemplary directed graph. FIG. 2 shows the domain and range of a property in a schema of a directed graph. FIG. 3 shows an exemplary SPARQL query and results of the query. FIG. 4 is a diagram illustrating relationships between, and components of, a knowledge graph, an ontology, and a meta ontology. FIG. 5 is a flowchart of a method for extracting properties from a set of source documents. FIG. 6 is a diagram of a computing environment in which disclosed techniques for property extraction can be performed. FIGS. 7A-7C illustrate example source documents having information from which properties can be extracted. FIG. 8 illustrates an example user interface for selecting source files for analysis and for identifying instances associated with such source files. FIG. 9 illustrates an example user interface where a user can define or edit a set of source files for particular instances. FIG. 10 illustrates an example user interface where a user can view and edit instance graphs created from one or more sources files for an instance, including viewing properties extracted from the source documents or property values. FIG. 11 provides example pseudocode for aligning instance graphs, such as using a common vocabulary, and counting the occurrence of particular properties in a set of instance graphs. FIG. 12A illustrates an example use