US-12619613-B2 - Systems and methods for generating explainable features for machine learning
Abstract
A system and method are provided for feature engineering. The method may include obtaining data from a plurality of data sources. The method may also include generating, based on the data, features for training machine learning models using a single data pipeline that uses a feature generation logic that is separate from an execution logic. The method may also include generating, based on relations and/or joins in the feature generation logic, metadata for explaining the features, and generating, based on the metadata, lineage information representing a mapping of one or more features to intermediate tables and/or input tables. The method may also include displaying the lineage information representing the mapping, including allowing a user to select and/or drill down different features, intermediate tables and/or input tables. The method may also include storing the features and the metadata for the features to a feature store.
Inventors
- Abhijeet Singh BAIS
- Rajan Narayanan
- Sanjay YERMALKAR
- Steven EGE
- Yinxiang Wang
Assignees
- Elevance Health, Inc.
Dates
- Publication Date
- 20260505
- Application Date
- 20250228
Claims (4)
- 1 . A method for engineering explainable features for machine learning, the method comprising: interfacing with a plurality of data sources corresponding to a plurality of cloud data platforms; obtaining data from the plurality of data sources, wherein the data includes two or more of: streams, files and tables; identifying updated data sources for triggering feature generation based on profiling a feature generation logic; determining whether to execute or postpone feature generation based on whether a source of the data is updated; and in accordance with a determination that the source of the data is updated: generating, based on the data, a subset of features based on frequency of training of one or more machine learning models using a single data pipeline that uses the feature generation logic that is separate from an execution logic used to generate the features, wherein the feature generation logic is based on a type of data source corresponding to the data, wherein the execution logic is agnostic to the type of data source, wherein the feature generation logic includes relations and/or joins between input tables and/or intermediate tables of the plurality of data sources, wherein features are generated by configuring and executing the single data pipeline based on the execution logic, wherein the feature generation logic is implemented using SQL queries, wherein the feature generation logic includes toggle switches for turning on or off portions specific to data sources of the plurality of data sources, for debugging purposes; generating, based on the relations and/or joins, metadata for explaining the features for training the one or more machine learning models; generating, based on the metadata, lineage information representing a mapping of one or more features to intermediate tables and/or input tables; and combining the metadata with the feature generation logic and storing, to a feature store, the combination in a text-based format for representing structured data, wherein the feature store is used for training the one or more machine learning models, wherein the feature generation logic provides an interface that allows model developers or data scientists to add features to the feature store.
- 2 . The method of claim 1 , further comprising: displaying the lineage information representing the mapping of one or more features to intermediate tables and/or input tables, including allowing a user to select and/or drill down different features, intermediate tables and/or input tables.
- 3 . The method of claim 1 , wherein generating the features comprises configuring and executing queries in the single data pipeline based on the feature generation logic.
- 4 . The method of claim 1 , further comprising: while generating the features, generating only a subset of the features based on frequency of training of the one or more machine learning models.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of U.S. Provisional Application No. 63/558,933, filed Feb. 28, 2024, the entirety of which is incorporated herein by reference. FIELD The invention relates to systems and methods for machine learning, and is more particularly, but not by way of limitation, directed to technology for feature engineering for machine learning. BACKGROUND Feature engineering is the process of using domain knowledge to create features (e.g., characteristics or properties) for machine learning algorithms. Feature engineering can be used to improve performance of machine learning algorithms. Feature engineering may include creation of features, such as interaction features, aggregate features and indicators, tailored to augment the predictive power of the learning algorithm. A feature store may be used to manage and serve machine-learning features. The feature store allows reusability and consistency of features across machine learning models. The feature store serves as a repository that enables storage, description, discovery, and access of features. The feature store provides a uniform interface that allows data modelers and data scientists to use the same features across different models. The feature store helps ensure efficiency, reliability and integrity of machine learning models, and enables collaboration between engineering teams. A data pipeline may be used to create a feature store. The proliferation of cloud services and execution environments pose many challenges for feature engineering. Combining data from different cloud platforms and/or obtaining a unified view of the data can be difficult due to lack of common schemas. Features may need to be normalized before model building. Variation in metrics, data collection APIs and monitoring capabilities across cloud providers may impact feature availability. Features and target variables can change over time. SUMMARY Accordingly, there is a need for systems and methods for feature engineering that address at least some of the problems described above. Some embodiments provide a feature engineering data pipeline. The data pipeline may provide consistency. The pipeline may ensure uniformity in the process of data collection, transformation, ingestion, and extraction. This may help ensure that the features generated are consistent across all models, leading to better accuracy in the prediction. The data pipeline may provide scalability and flexibility. With increasing data volume, the pipeline may handle and process growing datasets without compromising performance. The data pipeline may provide integrity. The pipeline may maintain data accuracy and completeness during transformation processes, which are crucial in the quality of the features generated. For some machine learning applications, features may need to be extracted in real-time. Therefore, the pipeline may handle in-time data process. For example, fraud detection may need to predict fraud before claim approval, so features may need to be available for prediction. The data pipeline may include automated data cleansing and formatting processes, which reduce manual intervention. The data pipeline may also provide observability and lineage. To ensure proper functioning, the pipeline may include mechanisms for tracking data flow, detecting anomalies or issues, and providing alerts to stakeholders. The pipeline may include provisions for tracking changes in feature definitions, transformations applied and raw data, ensuring traceability of the feature generation process. The feature engineering techniques described herein may include the separation of feature generation logic from execution environment used to generate the features. The execution environment may be cloud-agnostic. Separation of the feature generation from the execution environment allows new features to be added easily. Moreover, the process may include generation of metadata based on the feature generation logic. The metadata may be used to create lineage for features, which may be used to outline the origin and/or processing of features. Furthermore, profiling of the feature generation logic may be used to determine whether execute extraction and/or creation of features based on the appropriateness of the source data. Also, the feature engineering techniques described herein enable a flexible execution pipeline that may accommodate various feature generation frequencies (e.g., 2,000 daily features, 23,000 weekly features) generated using a same feature pipeline. Separate feature generation logic also simplifies source additions. For example, the functional logic may be contained in a JSON-based configuration file, so it is easy to add new data sources. One or more embodiments of the invention are directed to an improved system and method for feature engineering. The method may be performed at a system or a server having one or more processors, memory, and one or more programs