US-12625876-B2 - Extensible data enclave pattern

US12625876B2US 12625876 B2US12625876 B2US 12625876B2US-12625876-B2

Abstract

Systems, methods, and non-transitory computer-readable media for forming an extensible data warehouse. A data ingestor application receiving raw data having a first structure. Forming a data lake in the third memory using the raw data. Continuously receive additional raw data having a plurality of structures. The plurality of structures including the first structure and one or more different structures. The additional raw data supplementing the raw data. Determining each structure of the plurality of structures. Generating a dataset based on the additional raw data and the plurality of structures that are determined. Extracting metadata associated with the additional raw data from the dataset. Creating a catalog of the dataset based on the metadata that is extracted. Modifying the data lake in the third memory to include the additional raw data based on the dataset and the catalog.

Inventors

Justine Celeste Fox
Yanis Ikene

Assignees

Mastercard Technologies Canada ULC

Dates

Publication Date: 20260512
Application Date: 20240306

Claims (20)

1 . A system for forming an extensible data warehouse, the system comprising: a client device including a first electronic processor and a first memory; a storage device including a second electronic processor and a second memory, the storage device associated with the client device; and a server including a third electronic processor and a third memory including a data ingestor application, the third electronic processor configured to: receive, with a data ingestor application, raw data having a first structure, create, with the data ingestor application, a first metadata table for the raw data using an input crawler, the input crawler extracting metadata of the raw data, format, with the data ingestor application, the raw data for storage by performing an extract operation, a transform operation, and a load operation on the raw data, create, with the data ingestor application, a second metadata table for the raw data using an output crawler, the output crawler extracting metadata of the raw data that is formatted for storage, form, with the data ingestor application, a data lake in the third memory using the raw data, create, with the data ingestor application using the first metadata table and the second metadata table, a catalog describing the raw data of the data lake, continuously receive, with the data ingestor application, additional raw data having a plurality of structures, wherein the plurality of structures includes the first structure and one or more different structures, and wherein the additional raw data supplements the raw data, determine, with the data ingestor application, each structure of the plurality of structures, generate, with the data ingestor application, a dataset based on the additional raw data and the plurality of structures that are determined, extract, with the data ingestor application, metadata associated with the additional raw data from the dataset, update, with the data ingestor application, the catalog for the data lake-based on the metadata that is extracted from the additional raw data, and modify, with the data ingestor application, the data lake in the third memory to include the additional raw data based on the dataset and the catalog that is updated.
2 . The system of claim 1 , wherein, to receive, with the data ingestor application, the additional raw data having the plurality of structures, the third electronic processor is further configured to horizontally scale resources to receive the additional raw data having the plurality of structures.
3 . The system of claim 1 , wherein the third electronic processor is further configured to: receive, with the data ingestor application, an event notification associated with a change to a dataset stored in the first memory or the second memory, wherein the change is a result of the first electronic processor or the second electronic processor processing a request, and wherein the additional raw data is associated with the change to the dataset stored in the first memory or the second memory, modify, with the data ingestor application, the data lake in the third memory based on the change to the dataset stored in the first memory or the second memory, generate, with the data ingestor application, an event stream associated with the event notification, and transmit, with the data ingestor application via the event stream, the data lake that is modified based on the change to the dataset stored in the first memory or the second memory to a device of the system for storage in memory.
4 . The system of claim 1 , wherein generating the dataset based on the additional raw data and the plurality of structures that are determined, the third electronic processor is further configured to: classify, with the data ingestor application, a source file of the additional raw data based on a structure of the source file, format, with the data ingestor application, the additional raw data that is classified, wherein the additional raw data that is formatted is stored in the data lake, classify, with the data ingestor application, the source file of the additional raw data that is formatted based on a size of the source file of the additional raw data that is formatted, compact, with the data ingestor application, the source file of the additional raw data that is that is classified as small with a source file of the additional raw data that is classified as large into a larger source file, and store, with the data ingestor application, the additional raw data that is compacted in the data lake.
5 . The system of claim 4 , wherein generating the dataset based on the additional raw data and the plurality of structures that are determined, the third electronic processor is further configured to: define, with the data ingestor application, security policies to restrict permissions to access to the additional raw data of the data lake, and update, with the data ingestor application, the security policies of the data lake in near real-time using the additional raw data, wherein the additional raw data stored in the second memory of the storage device is provided by an authorized collective of computing devices.
6 . The system of claim 4 , wherein each structure of the plurality of structures that is determined is based on a file format and database schema associated with the raw data and the additional raw data.
7 . The system of claim 1 , wherein each structure of the plurality of structures includes a structure type selected from a group consisting of: an array structure, a linked list structure, a stack structure, a queue structure, a hash table structure, a tree structure, a heap structure, and a graph structure.
8 . A method for forming an extensible data warehouse, the method comprising: receiving, with a data ingestor application, raw data having a first structure; creating, with the data ingestor application, a first metadata table for the raw data using an input crawler, the input crawler extracting metadata of the raw data; formatting, with the data ingestor application, the raw data for storage by performing an extract operation, a transform operation, and a load operation on the raw data; creating, with the data ingestor application, a second metadata table for the raw data using an output crawler, the output crawler extracting metadata of the raw data that is formatted for storage; forming, with the data ingestor application, a data lake in a third memory of a server using the raw data; creating, with the data ingestor application using the first metadata table and the second metadata table, a catalog describing the raw data of the data lake; continuously receiving, with a data ingestor application, additional raw data having a plurality of structures, wherein the plurality of structures includes the first structure and one or more different structures, and wherein the additional raw data supplements the raw data; determining, with the data ingestor application, each structure of the plurality of structures; generating, with the data ingestor application, a dataset based on the additional raw data and the plurality of structures that are determined; extracting, with the data ingestor application, metadata associated with the additional raw data from the dataset; updating, with the data ingestor application, the catalog for the data lake based on the metadata that is extracted from the additional raw data; and modifying, with the data ingestor application, the data lake in the third memory to include the additional raw data based on the dataset and the catalog that is updated.
9 . The method of claim 8 , wherein, to receive, with the data ingestor application, the additional raw data having the plurality of structures, the method further comprises: receiving, by horizontally scaled resources, the additional raw data having the plurality of structures.
10 . The method of claim 8 , further comprising: receiving, with the data ingestor application, an event notification associated with a change to a dataset stored in a first memory of a client device or a second memory of a storage device, wherein the change is a result of a first electronic processor of a client device or a second electronic processor of a storage device processing a request, and wherein the additional raw data is associated with the change to the dataset stored in the first memory or the second memory; modifying, with the data ingestor application, the data lake in the third memory based on the change to the dataset stored in the first memory or the second memory; generating, with the data ingestor application, an event stream associated with the event notification; and transmitting, with the data ingestor application via the event stream, the data lake that is modified based on the change to the dataset stored in the first memory or the second memory to a device storage in memory.
11 . The method of claim 8 , wherein generating the dataset based on the additional raw data and the plurality of structures that are determined, further comprises: classifying, with the data ingestor application, a source file of the additional raw data based on a structure of the source file; formatting, with the data ingestor application, the additional raw data that is classified, wherein the additional raw data that is formatted is stored in the data lake; classifying, with the data ingestor application, the source file of the additional raw data that is formatted based on a size of the source file of the additional raw data that is formatted; compacting, with the data ingestor application, the source file of the additional raw data that is that is classified as small with a source file of the additional raw data that is classified as large into a larger source file; and storing, with the data ingestor application, the additional raw data that is compacted in the data lake.
12 . The method of claim 11 , wherein generating the dataset based on the additional raw data and the plurality of structures that are determined, further comprises: defining, with the data ingestor application, security policies to restrict permissions to access to the additional raw data of the data lake; and updating, with the data ingestor application, the security policies of the data lake in near real-time using the additional raw data, wherein the additional raw data stored in a second memory of a storage device is provided by an authorized collective of computing devices.
13 . The method of claim 11 , wherein each structure of the plurality of structures that is determined is based on a file format and database schema associated with the raw data and the additional raw data.
14 . The method of claim 8 , wherein each structure of the plurality of structures includes a structure type selected from a group consisting of: an array structure, a linked list structure, a stack structure, a queue structure, a hash table structure, a tree structure, a heap structure, and a graph structure.
15 . A non-transitory computer-readable medium comprising instructions for forming an extensible data warehouse that, when executed by an electronic processor, cause the electronic processor to perform a set of operations comprising: receiving raw data having a first structure; creating a first metadata table for the raw data using an input crawler, the input crawler extracting metadata of the raw data; formatting the raw data for storage by performing an extract operation, a transform operation, and a load operation on the raw data; creating a second metadata table for the raw data using an output crawler, the output crawler extracting metadata of the raw data that is formatted for storage; forming a data lake in a third memory using the raw data; creating, using the first metadata table and the second metadata table, a catalog describing the raw data of the data lake; continuously receiving additional raw data having a plurality of structures, wherein the plurality of structures includes the first structure and one or more different structures, and wherein the additional raw data supplements the raw data; determining each structure of the plurality of structures; generating a dataset based on the additional raw data and the plurality of structures that are determined; extracting metadata associated with the additional raw data from the dataset; updating the catalog for the data lake based on the metadata that is extracted from the additional raw data; and modifying the data lake in the third memory to include the additional raw data based on the dataset and the catalog that is updated.
16 . The non-transitory computer-readable medium of claim 15 , wherein, to receive the additional raw data having the plurality of structures, the set of operations further comprises: receiving, by horizontally scale resources, the additional raw data having the plurality of structures.
17 . The non-transitory computer-readable medium of claim 15 , further comprising: receiving an event notification associated with a change to a dataset stored in a first memory of a client device or a second memory of a storage device, wherein the change is a result of a first electronic processor of the client device or a second electronic processor of the storage device processing a request, and wherein the additional raw data is associated with the change to the dataset stored in the first memory or the second memory; modifying the data lake in the third memory based on the change to the dataset stored in the first memory or the second memory; generating an event stream associated with the event notification; and transmitting, via the event stream, the data lake that is modified based on the change to the dataset stored in the first memory or the second memory to the client device or the storage device.
18 . The non-transitory computer-readable medium of claim 15 , wherein generating the dataset based on the additional raw data and the plurality of structures that are determined, further comprises: classifying a source file of the additional raw data based on a structure of the source file; formatting the additional raw data that is classified, wherein the additional raw data that is formatted is stored in the data lake; classifying the source file of the additional raw data that is formatted based on a size of the source file of the additional raw data that is formatted; compacting the source file of the additional raw data that is that is classified as small with a source file of the additional raw data that is classified as large into a larger source file; and storing the additional raw data that is compacted in the data lake.
19 . The non-transitory computer-readable medium of claim 18 , wherein generating the dataset based on the additional raw data and the plurality of structures that are determined, further comprises: defining security policies to restrict permissions to access to the additional raw data of the data lake; and updating security policies of the data lake in near real-time using the additional raw data, wherein the additional raw data stored in a second memory of a storage device is provided by an authorized collective of computing devices.
20 . The non-transitory computer-readable medium of claim 15 , wherein each structure of the plurality of structures includes a structure type selected from a group consisting of: an array structure, a linked list structure, a stack structure, a queue structure, a hash table structure, a tree structure, a heap structure, and a graph structure.

Description

RELATED APPLICATION This application claims priority to U.S. Provisional Patent Application No. 63/488,881, filed on Mar. 7, 2023, the entire contents of which are hereby incorporated by reference. FIELD OF THE INVENTION The present disclosure relates generally to data enclaves. More specifically, the present disclosure relates to systems, methods, and non-transitory computer-readable media for forming an extensible data warehousing platform with built-in data security controls. BACKGROUND A data enclave is a tool designed to share information derived from raw data rather than sharing the actual raw data. Data enclaves provide a confidential, protected environment in which authorized users can access sensitive data remotely while providing a secure dissemination platform. In a networked database, data enclaves make available only aggregate results, such as coefficients and counts. Data enclaves are implemented as a cloud-based platform that replaces on-premise infrastructure and provides both the safe storage of datasets and scalable computing resources that operate on raw data that no longer needs to be on the user's physical desktop computer. SUMMARY The transformation of raw data into business insights and information in real time and/or a batch process is growing increasingly more problematic due to how fast software development teams are trying to produce and create new pieces of software, and new data sources collect new data from consumers. Due to the increasingly more detailed data payloads, solutions are needed to make the payloads accessible to gain actual business value. While you can scale data ingestion, data storage and data access with fixed data structures, it is common to encounter variable data structures and mutating data attributes that break data analysis and big data pipeline tools, which result in a nonfunctioning data lake because something has changed in a data store in a recent deploy. To solve the challenges with variable data structures (e.g., freeform text data, JSON, comma deliminated, pipe deliminated, missing or additional columns) and mutating data attributes, a platform is needed to collect, manage, organize, sort, process an unlimited amount of different data repositories into a data lake with full data access control and the ability to extract, transform, and load data with as minimal human involvement as possible. In one instance, a platform is designed to flexibly accommodate (e.g., horizontally scale) a fast-paced product development environment where data structures may change unexpectedly that combines big data pipeline concepts with the concept of a data lake that stores data in any format. Another benefit of the platform described herein is that the ability to automatically detect the format of data and make the data available in a data warehousing platform for data analytics for data science with built in rigorous data processing and data security controls to ensure that privacy, integrity, and confidentiality of the data access usage. One embodiment described herein is a system forming an extensible data warehouse. The system includes a client device including a first electronic processor and a first memory, a storage device including a second electronic processor and a second memory, and a server including a third electronic processor and a third memory. The third memory includes a data ingestor application. The data ingestor application receiving raw data having a first structure, forming a data lake in the third memory using the raw data, continuously receiving additional raw data having a plurality of structures, wherein the plurality of structures includes the first structure and one or more different structures, and wherein the additional raw data supplements the raw data, determining each structure of the plurality of structures, generating a dataset based on the additional raw data and the plurality of structures that are determined, extracting metadata associated with the additional raw data from the dataset, creating a catalog of the dataset based on the metadata that is extracted, and modifying the data lake in the third memory to include the additional raw data based on the dataset and the catalog. Another embodiment described herein is a method. The method includes receiving, with a data ingestor application, raw data having a first structure. The method includes forming, with the data ingestor application, a data lake in a third memory of a server using the raw data. The method includes continuously receiving, with a data ingestor application, additional raw data having a plurality of structures, wherein the plurality of structures includes the first structure and one or more different structures, and wherein the additional raw data supplements the raw data. The method includes determining, with the data ingestor application, each structure of the plurality of structures. The method includes generating, with the data ingestor application, a dataset