US-20260127194-A1 - SYSTEMS AND METHODS FOR AUTOMATICALLY DERIVING DATA TRANSFORMATION CRITERIA
Abstract
Systems, apparatuses, methods, and computer program products are disclosed for automatically deriving data transformation criteria. An example method includes receiving, by communications circuitry, a source dataset and a target dataset and identifying, by a model generator, a target variable. The example method further includes training, by the model generator, a decision tree for the target variable using the source dataset and the target dataset such that the trained decision tree can predict a value for the target variable from new source data. The example method further includes deriving, by a derivation engine, a set of parameters and pseudocode for producing the target variable from the source dataset.
Inventors
- Brian Karp
- Damon Antoine Porter
- Stacy Renee Henryson
- Ethan M. Hopkins
- Matthew Dean Wilson
Assignees
- WELLS FARGO BANK, N.A.
Dates
- Publication Date
- 20260507
- Application Date
- 20251222
Claims (20)
- 1 . A method for generating a user interface for automatically derived data transformation criteria, the method comprising: identifying, by a model generator, a target variable; receiving, by communications circuitry, a decision tree for the target variable, wherein the decision tree is trained using a source dataset and a target dataset such that the decision tree can predict a value for the target variable from new source data, wherein the decision tree is predicted to produce a first value for the target variable; identifying, by a prediction engine, a set of exceptions comprising a first difference between the first value for the target variable and a second value for the target variable found in the target dataset; and producing, by the communications circuitry, an exception report comprising the set of exceptions.
- 2 . The method of claim 1 , wherein the first difference relates to a first historical prediction at a first time, wherein the set of exceptions comprises a second difference related to a second historical prediction at a second time, the method comprising: generating a historical misclassification trend based on the first historical prediction and the second historical prediction, wherein the exception report comprises the historical misclassification trend.
- 3 . The method of claim 1 , further comprising: receiving, by the communications circuitry, an indication of selecting the target variable in the user interface of the exception report; and displaying, by the communications circuitry, the target variable in the user interface of the exception report.
- 4 . The method of claim 1 , further comprising: cleansing, by the model generator, the source dataset and the target dataset prior to receiving the decision tree, wherein the decision tree is further trained by optimizing hyperparameters of the decision tree.
- 5 . The method of claim 1 , further comprising: determining, by the model generator, if an imbalance of values of the target variable exists in the target dataset; and in an instance in which the imbalance of the values of the target variable in the target dataset is determined, modifying, by the model generator, the source dataset and the target dataset to reduce the imbalance.
- 6 . The method of claim 5 , wherein modifying the source dataset and the target dataset to reduce the imbalance includes: undersampling data points in the source dataset and the target dataset appearing to be overrepresented; or oversampling data points in the source dataset and the target dataset appearing to be underrepresented.
- 7 . The method of claim 1 , further comprising: presenting, by a visualizer, an interactive dashboard visualization of the decision tree, the interactive dashboard visualization enabling a user to traverse branches of the decision tree.
- 8 . The method of claim 1 , comprising: deriving a set of parameters and pseudocode for producing the target variable from the source dataset.
- 9 . The method of claim 8 , wherein deriving the set of parameters and the pseudocode for producing the target variable from the source dataset comprises: extracting, by a derivation engine, filter criteria and associated parameters from each branch of the decision tree; and generating, by the derivation engine and from the filter criteria and the associated parameters for each branch of the decision tree, the set of parameters and the pseudocode for producing the target variable from the source dataset.
- 10 . The method of claim 1 , further comprising: receiving, by the communications circuitry, a new source dataset and a new target dataset; and generating, by the prediction engine and using the decision tree and the source dataset, a set of predicted target values, wherein the second value belongs to the set of predicted target values.
- 11 . An apparatus for generating a user interface for automatically derived data transformation criteria, the apparatus comprising: a model generator configured to: identify a target variable; communications circuitry configured to: receive a decision tree for the target variable, wherein the decision tree is trained using a source dataset and a target dataset such that the decision tree can predict a value for the target variable from new source data, wherein the decision tree is predicted to produce a first value for the target variable; and a prediction engine configured to: identify a set of exceptions comprising a first difference between the first value for the target variable and a second value for the target variable found in the target dataset, wherein the communications circuitry is further configured to produce an exception report comprising the set of exceptions.
- 12 . The apparatus of claim 11 , wherein the first difference relates to a first historical prediction at a first time, wherein the set of exceptions comprises a second difference related to a second historical prediction at a second time, wherein the prediction engine is further configured to: generate a historical misclassification trend based on the first historical prediction and the second historical prediction, wherein the exception report comprises the historical misclassification trend.
- 13 . The apparatus of claim 11 , wherein the communications circuitry is further configured to: receive an indication of selecting the target variable in the user interface of the exception report; and display the target variable in the user interface of the exception report.
- 14 . The apparatus of claim 11 , wherein the model generator is further configured to: cleanse the source dataset and the target dataset prior to receiving the decision tree, wherein the decision tree is further trained by optimizing hyperparameters of the decision tree.
- 15 . The apparatus of claim 11 , wherein the model generator is further configured to: determine if an imbalance of values of the target variable exists in the target dataset; and in an instance in which the imbalance of the values of the target variable in the target dataset is determined, modify the source dataset and the target dataset to reduce the imbalance.
- 16 . The apparatus of claim 15 , wherein modifying the source dataset and the target dataset to reduce the imbalance includes: undersampling data points in the source dataset and the target dataset appearing to be overrepresented; or oversampling data points in the source dataset and the target dataset appearing to be underrepresented.
- 17 . The apparatus of claim 11 , wherein the communications circuitry is further configured to: present, by a visualizer, an interactive dashboard visualization of the decision tree, the interactive dashboard visualization enabling a user to traverse branches of the decision tree.
- 18 . The apparatus of claim 11 , wherein the model generator is further configured to: derive a set of parameters and pseudocode for producing the target variable from the source dataset.
- 19 . The apparatus of claim 18 , wherein deriving the set of parameters and the pseudocode for producing the target variable from the source dataset comprises: extracting filter criteria and associated parameters from each branch of the decision tree; and generating, from the filter criteria and the associated parameters for each branch of the decision tree, the set of parameters and the pseudocode for producing the target variable from the source dataset.
- 20 . A computer program product for generating a user interface for automatically derived data transformation criteria, the computer program product comprising computer software instructions which, when executed by an apparatus, cause the apparatus to: identify a target variable; receive a decision tree for the target variable, wherein the decision tree is trained using a source dataset and a target dataset such that the decision tree can predict a value for the target variable from new source data, wherein the decision tree is predicted to produce a first value for the target variable; identify a set of exceptions comprising a first difference between the first value for the target variable and a second value for the target variable found in the target dataset; and produce an exception report comprising the set of exceptions.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of U.S. patent application Ser. No. 18/802,825, filed Aug. 13, 2024, which is a continuation of U.S. patent application Ser. No. 18/184,933, filed Mar. 16, 2023, and issued as U.S. Pat. No. 12,079,239, which is a continuation of U.S. patent application Ser. No. 17/177,029, filed Feb. 16, 2021, and issued as U.S. Pat. No. 11,636,132, the entire contents of all of which are incorporated herein by reference. TECHNOLOGICAL FIELD Example embodiments of the present disclosure relate generally to machine learning and, more particularly, to systems and methods for using machine learning to understand how data is transformed and applied, and leveraging that understanding for error-reduction and predictive analysis. BACKGROUND The volume of data available for inspection and use has grown substantially over the last few decades, and it grows at a faster rate each year. In parallel, computing resources have become more powerful and the techniques for analyzing data increase in their nuance. As a result of these changes, there is an ever-increasing reliance on data in all areas of business and life. Moreover, reliance on this data increasingly requires the transformation of data from one format to another, whether because a particular data evaluation requires data to be presented in a new format, because data must be collected from a variety of source repositories that do not store the data in the same format, or for any number of other reasons. Accordingly, transformation of data is an unavoidable aspect of the use of large datasets. BRIEF SUMMARY Given that almost all uses of data involve the transformation of the data from its original form into some new form, understanding the ways that data is transformed is critical for monitoring, auditing, or reviewing the ways that the data is used. Accordingly, the development of new tools for this purpose solves a currently unmet need for technical and automatic solutions that avoid the bias, error, and resource-intensity inherent in manual methods for tracking data lineage. Historically, documenting the transformations of data has been a manual exercise, and the veracity of the documentation has always been indeterminate (was it done well, or as an after-thought?). In fact, ad hoc manual documentation is largely the default practice even today. However, when an organization must evaluate the lineage of its data to ensure accuracy and avoidance of errors, it may often be the case that the nature of the transformations made to data are opaque, either because of the number of intermediate transformations to the data between its source repository and a given application of that data, or because the transformations were not all undertaken in a single location, by a single actor or entity, or at a single time, or perhaps the documentation was never prepared for every transformation along the way, or the documentation describing the nature of a given data transformation is inaccurate, which may occur for any number of reasons. Accordingly, an organization may not be positioned to understand the nature of how its data has been transformed through the point at which the organization wishes to utilize the data. As noted above, this lack of authoritative understanding of data transformations presents a critical technical hurdle that organizations must overcome in order to authoritatively rely on the data that is used in various tasks. When the data is used for purposes such as regulatory reporting, or mission-critical applications, errors in the data transformations can cause significant failures that can materially impact the organization. Moreover, where the lineage of a given data element is not known to any individual in an organization, there is a significant technical challenge posed for deriving the nature of the transformations that the data element undertook in the course of a given operation. Systems, apparatuses, methods, and computer program products are disclosed herein for addressing these technical hurdles by automatically deriving the criteria causing the transformation of data from a source dataset to a target dataset generated from the source dataset. As described below, example embodiments described herein may be provided the source dataset and the target dataset, and may derive the data transformation criteria for a particular target variable. In one example embodiment, a system is provided for automatically deriving the data transformation criteria for such a target variable. The system includes communications circuitry configured to receive a source dataset and a target dataset, and a model generator configured to identify a target variable, and train a decision tree for the target variable using the source dataset and the target dataset such that the trained decision tree can predict a value for the target variable from new source data. The system further includes a derivation engine confi