US-20260127153-A1 - Feature Analysis With Causal Discovery for Dataset Optimization

US20260127153A1US 20260127153 A1US20260127153 A1US 20260127153A1US-20260127153-A1

Abstract

The present disclosure provides systems and methods for analysis and management of complex datasets. An example method can include obtaining a dataset, evaluating one or more feature importance metrics for feature subsets to generate feature importance values, generating one or more feature importance matrices from the feature importance values, and then identifying one or more asymmetries in the feature importance matrices. The identified asymmetries can be used to generate graphical user interface visualizations to facilitate improved understanding and management of the dataset. For example, the identified asymmetries and/or generated visualization can be used for manual and/or automatic generation of feature correction actions which improve the quality of the underlying dataset. Alternatively or additionally, the identified asymmetries and/or generated visualization can be used for the identification of causal relationships in controllable systems, leading to the ability to provide improved control of controllable systems.

Inventors

Jacob Daniel Beel
Bridgette Jayne Befort DeFever
Christopher James Hazard
Jack Xia

Assignees

HOWSO INCORPORATED

Dates

Publication Date: 20260507
Application Date: 20251105

Claims (20)

1 . A computer-implemented method for improved feature analysis, the method comprising: obtaining, by a computing system comprising one or more computing devices, a dataset comprising a plurality of cases, wherein each of the plurality of cases has a plurality of values respectively for a plurality of features; evaluating, by the computing system based on the dataset, a feature importance metric for a plurality of feature subsets of the plurality of features to respectively generate a plurality of feature importance values, wherein the feature importance value for each feature subset indicates an importance of the feature subset in predicting another feature subset; generating, by the computing system, a feature importance matrix from the plurality of feature importance values, wherein the feature importance matrix comprises a square matrix with row and column labels equal to the plurality of feature subsets; and identifying, by the computing system, one or more asymmetries exhibited by the feature importance matrix.
2 . The computer-implemented method of claim 1 , wherein the feature importance metric comprises a feature contribution metric.
3 . The computer-implemented method of claim 1 , wherein the feature importance metric comprises a mean decrease in accuracy metric.
4 . The computer-implemented method of claim 1 , further comprising generating, by the computing system, a graphical user interface visualization based on the asymmetries identified from the feature importance matrix.
5 . The computer-implemented method of claim 4 , wherein the graphical user interface visualization comprises a directed graph, and wherein the directed graph comprises two or more nodes that correspond to two or more of the feature subsets, and wherein the directed graph comprises one or more directed edges that each demonstrate a directional relationship between two of the feature subsets that correspond to two of nodes, each directional relationship derived from one of the one or more asymmetries.
6 . The computer-implemented method of claim 4 , wherein the graphical user interface visualization comprises a matrix visualization of at least a portion of the feature importance matrix, wherein the matrix visualization comprises a visual characteristic that identifies the one or more asymmetries.
7 . The computer-implemented method of claim 4 , wherein the graphical user interface visualization comprises a quadrant visualization that sorts at least some of the plurality of feature subsets into four quadrants, the four quadrants corresponding to: high reduction in uncertainty but low importance; low reduction in uncertainty and low importance; low reduction in uncertainty but high importance; and high reduction in uncertainty and high importance.
8 . The computer-implemented method of claim 1 , wherein said steps of evaluating, generating, and identifying are performed for each of multiple different feature importance metrics.
9 . The computer-implemented method of claim 1 , wherein: generating, by the computing system, the feature importance matrix from the plurality of feature importance values comprises normalizing, by the computing system, the plurality of feature importance values; and normalizing, by the computing system, the plurality of feature importance values comprises: normalizing, by the computing system, the plurality of feature importance values by a contribution to a percentage of an overall prediction; or normalizing, by the computing system, the plurality of feature importance values between a mean absolute deviation of the feature subset and a residual of predicting the feature subset given all of the dataset.
10 . The computer-implemented method of claim 1 , wherein the plurality of feature subsets comprise all feature subsets contained in a superset generated from the plurality of features.
11 . The computer-implemented method of claim 1 , wherein the plurality of feature subsets equal the plurality of features.
12 . The computer-implemented method of claim 1 , further comprising automatically generating, by the computing system, one or more feature correction actions based on the one or more asymmetries.
13 . The computer-implemented method of claim 12 , further comprising automatically performing, by the computing system, the one or more feature correction actions on the dataset.
14 . The computer-implemented method of claim 13 , further comprising, after performing the one or more feature correction actions, training or re-training, by the computing system, a machine-learned model on the dataset.
15 . The computer-implemented method of claim 1 , further comprising: identifying, by the computing system, a causal relationship between one or more of the feature subsets and a current state of a controllable system based on the one or more asymmetries; and controlling, by the computing system, the controllable system based on the causal relationship.
16 . The computer-implemented method of claim 1 , further comprising identifying, by the computing system, one or more graph structures exhibited by the feature importance matrix, wherein the one or more graph structures comprise one or more cliques, ergodic regions, or transitive regions.
17 . The computer-implemented method of claim 1 , further comprising identifying, by the computing system, one or more graph metric exhibited by the feature importance matrix, wherein the one or more graph metrics comprise one or more measurements of centrality, assortativity, or modularity.
18 . The computer-implemented method of claim 1 , wherein said steps of evaluating, generating, and identifying are performed for both a feature contributions metric and a mean decrease in accuracy metric.
19 . A computing system for improved feature analysis, the system comprising: one or more processors; and one or more non-transitory computer-readable media storing instructions that, when executed by the one or more processors, cause the system to perform operations comprising: obtaining a dataset comprising a plurality of cases, wherein each of the plurality of cases has a plurality of values respectively for a plurality of features; for a target feature subset of a plurality of feature subsets, evaluating a feature importance value by: determining a first predictive accuracy for the target feature subset based on a first set of input features that includes the target feature subset; determining a second predictive accuracy for the target feature subset based on a second set of input features that excludes the target feature subset; and generating the feature importance value for the target feature subset based on a comparison of the first predictive accuracy and the second predictive accuracy; generating a feature importance matrix from a plurality of feature importance values evaluated for the plurality of feature subsets; and identifying one or more asymmetries exhibited by the feature importance matrix.
20 . The computing system of claim 19 , wherein the operations further comprise: identifying, based on a magnitude of the feature importance value for the target feature subset, an indication that one or more unobserved features relevant to predicting the target feature subset are absent from the dataset.

Description

PRIORITY CLAIM This application claims priority to and the benefit of U.S. Provisional Application No. 63/716,559, filed Nov. 5, 2024. U.S. Provisional Application No. 63/716,559 is hereby incorporated by reference in its entirety. FIELD The present disclosure relates to computer-based reasoning systems and more specifically to data analysis and refinement for computer-based reasoning systems. BACKGROUND In the field of data analytics and machine learning, a technical challenge is the effective management and analysis of complex datasets, particularly in understanding the interactions and influences among various features within these datasets. Traditional methods often struggle to accurately quantify and visualize the relationships between different data features, especially when dealing with large volumes of data or datasets with a high degree of interconnectivity among features. For example, many traditional methods do not attempt to determine cause and effect between different data features, but instead focus exclusively on the ability to correctly predict a certain outcome or response. This limitation can lead to inefficiencies in data processing and analysis, as well as potential inaccuracies in the outcomes of predictive models. Conversely, models and techniques which do attempt to develop a better understanding of the causal relationships between features often tend to sacrifice flexibility. Thus, improved techniques are needed which enable feature analysis such as causal discovery without comprising other capabilities of the model. SUMMARY A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions. One general aspect includes a computer-implemented method for improved feature analysis. The computer-implemented method includes obtaining, by a computing system may include one or more computing devices, a dataset may include a plurality of cases, where each of the plurality of cases has a plurality of values respectively for a plurality of features. The method also includes evaluating, by the computing system based on the dataset, a feature importance metric for a plurality of feature subsets of the plurality of features to respectively generate a plurality of feature importance values, where the feature importance value for each feature subset indicates an importance of the feature subset in predicting another feature subset. The method also includes generating, by the computing system, a feature importance matrix from the plurality of feature importance values, where the feature importance matrix may include a square matrix with row and column labels equal to the plurality of feature subsets. The method also includes identifying, by the computing system, one or more asymmetries exhibited by the feature importance matrix. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. Example implementations may include any combination of one or more of the following features. The computer-implemented method where the feature importance metric may include a predictive feature importance metric. The predictive feature importance metric may include a feature contribution metric. The method may further include generating, by the computing system, a graphical user interface visualization based on the asymmetries identified from the feature importance matrix. The graphical user interface visualization may include a directed graph. The directed graph may include two or more nodes that correspond to two or more of the feature subsets, and where the directed graph may include one or more directed edges that each demonstrate a directional relationship between two of the feature subsets that correspond to two of nodes, each directional relationship derived from one of the one or more asymmetries. Said steps of evaluating, generating, and identifying may be performed for each of multiple different feature importance metrics. Generating, by the computing system, the feature importance matrix from the plurality of feature importance values may include normalizing, by the computing system, the plurality of feature importance values. Normalizing, by the computing system, the plurality of feature importance values may include normalizing, by the computing system, the plurality of feature importance values by a contribution to a percentage of an overall prediction. The plurality of feature subsets may include all feature subsets contained in a superset ge