EP-4742054-A1 - DETERMINING AND ENHANCING COMPLETENESS METRICS IN DATA STRUCTURES WITH INSTANCES, CLASSES, AND PROPERTIES

EP4742054A1EP 4742054 A1EP4742054 A1EP 4742054A1EP-4742054-A1

Abstract

The present disclosure provides techniques and solutions for determining the completeness of a collection of datatype instances. Completeness can reflect the degree to which instances include values for instance properties. A first collection of instances is received, each including one or more instance types and stored as one or more data types in a data structure. The instance types define properties of the instances, and some instances lack values for certain properties while having values for others. A first completeness metric for a first instance type is determined by analyzing the presence or absence of property values and comparing unfilled properties to a defined set. If the first completeness metric does not meet a threshold, actions are taken, such as adjusting a computing process to access a second collection of instances or receiving values for missing properties.

Inventors

PORTISCH, JAN
BRACHOLDT, SANDRA
SAGGAU, VOLKER
Shetty, Shraddha

Assignees

SAP SE

Dates

Publication Date: 20260513
Application Date: 20251106

Claims (15)

A computing system comprising: at least one memory; one or more hardware processing units coupled to the at least one memory; and one or more computer readable storage media storing computer-executable instructions that, when executed, cause the computing system to perform operations comprising: receiving a first collection of a plurality of instances, the plurality of instances comprising instances of one or more instance types and being stored as one or more instances of one or more data types in a computer-implemented data structure, wherein the instance types define one or more properties of instances of the instance type, and wherein at least a portion of the instances associated with the one or more instance types do not comprise a value for one or more properties defined by at least one of the respective instance types and comprise a value for one or more other properties defined by at least one of the respective instance types; determining a first completeness metric for a first instance type of the one or more instance types by: (1) analyzing the presence or absence of property values for multiple instances of the plurality of instances having the first instance type; and (2) for given instances of the multiple instances, comparing a number of properties for the first instance type that are unfilled for the given instance to a set of one or more properties defined for the first instance type; at least in part in response to determining that the first completeness metric does not satisfy a first threshold: (A) adjusting a computing process to access a second collection of a plurality of instances; or (B) receiving a value for at least one missing property value of at least one instance of the multiple instances.
The computing system of claim 1, wherein the set of one or more properties defined for the first instance type are defined in an ontology, and the first instance type is a class in the ontology.
The computing system of any one of the preceding claims, the operations further comprising: aggregating instance completeness scores for the multiple instances to provide the first completeness metric, wherein the first completeness metric is a metric indicating a level of property completeness for the multiple instances of the first instance type.
The computing system of claim 3, the operations further comprising: calculating a second completeness metric by aggregating the first completeness metrics determined for respective properties defined for the one or more properties defined for the first instance type.
The computing system of claim 3, wherein the instance completeness score is determined by comparing a number of properties filled for an instance of the multiple instances to an expected completeness score defined for the first instance type.
The computing system of any one of the preceding claims, wherein the first completeness metric is an instance completeness score determined by comparing a number of properties filled for an instance of the multiple instances to an expected completeness score defined for the first instance type.
The computing system of claim 6, the operations further comprising: determining a variance in instance completeness scores for the multiple instances.
The computing system of any one of the preceding claims, the operations further comprising: determining the one or more properties defined by at least one of the respective types by aggregating properties defined for the multiple instances having the first instance type.
The computing system of any one of the preceding claims, the operations further comprising: logging a plurality of queries requesting one or more property values for instances of the multiple instances to provide a set of logged queries; from the set of logged queries, calculating a total number of queries requesting values for a property of the one or more properties of the first instance type for instances of the multiple instances where a queried instance of the multiple instances did not comprise a value for the property of the query; and determining the first completeness metric using the total number of queries.
The computing system of claim 9, the operations further comprising: from the set of logged queries, for instances of the multiple instances, determining a total number of requests for respective instances of the multiple instances; and weighting the first completeness metric by the total number of requests for an instance of the multiple instances.
The computing system of claim 9, the operations further comprising: normalizing the first completeness metric using a total number of queries received for the first collection of a plurality of instances.
The computing system of any one of the preceding claims, wherein the set of one or more properties defined for the first instance type are defined in an ontology, and the first instance type is a class in the ontology, and wherein the first completeness metric is a metric indicating a level of property completeness for the multiple instances of the first instance type, determined by aggregating instance completeness scores for the multiple instances, the operations further comprising: logging a plurality of queries requesting one or more property values for instances of the multiple instances to provide a set of logged queries; from the set of logged queries, calculating a total number of queries requesting values for a property of the one or more properties of the first instance type for instances of the multiple instances where a queried instance of the multiple instances did not comprise a value for the property of the query; and determining a second completeness metric using the total number of queries.
The computing system of claim 12, wherein (A) or (B) are performed at least in part based on determining that the second completeness metric does not satisfy a second threshold.
A method, implemented in a computer system comprising at least one hardware processor and at least one memory coupled to the at least one hardware processor, the method comprising: receiving a first collection of a plurality of instances, the plurality of instances comprising instances of one or more instance types and being stored as one or more instances of one or more data types in a computer-implemented data structure, wherein the instance types define one or more properties of instances of the instance type, and wherein at least a portion of the instances associated with the one or more instance types do not comprise a value for one or more properties defined by at least one of the respective instance types and comprise a value for one or more other properties defined by at least one of the respective instance types; determining a first completeness metric for a first instance type of the one or more instance types by: (1) analyzing the presence or absence of property values for multiple instances of the plurality of instances having the first instance type; and (2) for given instances of the multiple instances, comparing a number of properties for the first instance type that are unfilled for the given instance to a set of one or more properties defined for the first instance type; and (1) aggregating instance completeness scores for the multiple instances to provide the first completeness metric, wherein the first completeness metric is a metric indicating a level of property completeness for the multiple instances of the first instance type; or (2) logging a plurality of queries requesting one or more property values for instances of the multiple instances to provide a set of logged queries; from the set of logged queries, calculating a total number of queries requesting values for a property of the one or more properties of the first instance type for instances of the multiple instances where a queried instance of the multiple instances did not comprise a value for the property of the query; and determining the first completeness metric using the total number of queries.
One or more non-transitory computer-readable storage media comprising: computer-executable instructions that, when executed by a computing system comprising at least one hardware processor and at least one memory coupled to the at least one hardware processor, cause the computing system to execute the method of claim 16.

Description

FIELD The present disclosure generally relates to processes for analyzing completeness of data sets. BACKGROUND Knowledge graphs (KGs) have found increasing use as computer-implemented data repositories for organizing and representing structured information in a way that facilitates intelligent querying, inference, and analysis. A knowledge graph includes a set of classes (or types) representing conceptual entities, and instances (or entities) representing specific examples of those classes. Each instance is associated with a set of properties (or attributes) that describe its characteristics. For example, in a knowledge graph about academic publications, the class "Author" can have instances such as "John Smith" or "Mary Johnson," each of which is described by properties such as "affiliation" and "publication count." Knowledge graphs are often aligned with an ontology, which defines the relationships between classes and properties, and the constraints governing them. An ontology acts as a formal framework that provides the schema and logical structure underlying the knowledge graph, so that instances within a class adhere to specific rules and constraints. This allows systems to infer new information, validate consistency, and perform reasoning over the data. For example, an ontology might dictate that every instance of the class "Author" should have a property "affiliation," and that "affiliation" should be an instance of the class "Institution." However, the effectiveness of a knowledge graph is often contingent on the completeness, as well as the accuracy, of the information it contains. Missing property values for instances can degrade the quality and usefulness of a knowledge graph. For example, if an "Author" instance lacks an "affiliation" property, queries that rely on the "affiliation" information might yield incomplete or erroneous results. Inconsistent or incomplete data can also negatively impact downstream applications such as recommendation engines, predictive models, or data-driven decision-making tools. Accordingly, room for improvement exists. SUMMARY This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. The present disclosure provides techniques and solutions for determining the completeness of a collection of datatype instances. Completeness can reflect the degree to which instances include values for instance properties. A first collection of instances is received, each including one or more instance types and stored as one or more data types in a data structure. The instance types define properties of the instances, and some instances lack values for certain properties while having values for others. A first completeness metric for a first instance type is determined by analyzing the presence or absence of property values and comparing unfilled properties to a defined set. If the first completeness metric does not meet a threshold, actions are taken, such as adjusting a computing process to access a second collection of instances or receiving values for missing properties. In aspect, the present disclosure provides a process of determining data completeness in a computing system, and taking various actions if a threshold completeness is not satisfied. A first collection of a plurality of instances is received. The plurality of instances include instances of one or more instance types and is stored as one or more instances of one or more data types in a computer-implemented data structure. The instance types define one or more properties of instances of the instance type. At least a portion of the instances associated with the one or more instance types do not comprise a value for one or more properties defined by at least one of the respective instance types and comprise a value for one or more other properties defined by at least one of the respective instance types. A first completeness metric for a first instance type of the one or more instance types is determined. This involves analyzing the presence or absence of property values for multiple instances of the plurality of instances having the first instance type. For given instances of the multiple instances, a number of properties for the first instance type that are unfilled for the given instance is compared to a set of one or more properties defined for the first instance type. In response to determining that the first completeness metric does not satisfy a first threshold, either a computing process is adjusted to access a second collection of a plurality of instances, or a value for at least one missing property value of at least one instance of the multiple instances is received. The present disclosure also includes computing systems and tangible, non